Mapping Single Palm-Trees Species in Forest Environments with a Deep Convolutional Neural Network

Faculty of Engineering, Architecture, and Urbanism and Geography, Federal University of Mato Grosso do Sul (UFMS), Avenida Costa e Silva, 79070-900, Campo Grande, Mato Grosso do Sul, Brazil Post-Graduate Program in Environment and Regional Development, University of Western São Paulo (UNOESTE), Rodovia Raposo Tavares, km 572-Limoeiro, 19067-175, Presidente Prudente, São Paulo, Brazil Faculty of Engineering and Architecture and Urbanism, University of Western São Paulo (UNOESTE), Rodovia Raposo Tavares, km 572-Limoeiro, 19067-175, Presidente Prudente, São Paulo, Brazil Department of Geography and Environmental Management, University of Waterloo (UW), Waterloo, ON N2L 3G1, Canada

Accurately mapping individual tree species in densely forested environments is crucial to forest inventory. When considering only RGB images, this is a challenging task for many automatic photogrammetry processes. The main reason for that is the spectral similarity between species in RGB scenes, which can be a hindrance for most automatic methods. State-of-the-art deep learning methods could be capable of identifying tree species with an attractive cost, accuracy, and computational load in RGB images. This paper presents a deep learning-based approach to detect an important multi-use species of palm trees (Mauritia flexuosa; i.e., Buriti) on aerial RGB imagery. In South-America, this palm tree is essential for many indigenous and local communities because of its characteristics. The species is also a valuable indicator of water resources, which comes as a benefit for mapping its location. The method is based on a Convolutional Neural Network (CNN) to identify and geolocate singular tree species in a high-complexity forest environment, and considers the likelihood of every pixel in the image to be recognized as a possible tree by implementing a confidence map feature extraction. This study compares the performance of the proposed method against state-of-the-art object detection networks. For this, images from a dataset composed of 1,394 airborne scenes, where 5,334 palm-trees were manually labeled, were used. The results returned a mean absolute error (MAE) of 0.75 trees and an F1-measure of 86.9%. These results are better than both Faster R-CNN and RetinaNet considering equal experiment conditions. The proposed network provided fast solutions to detect the palm trees, with a delivered image detec-

Introduction
The unplanned development and land occupation in both urban and rural areas are the main reasons behind deforestation, contributing to environmental degradation in riparian zones and modifying the natural landscape. Multidisciplinary re-search is necessary to ascertain the population of vegetative species, identifying their locations and distribution patterns. Such information is essential for the management and conservation of vulnerable ecosystems, and mapping these environments may help governmental entities to control or mitigate environmental damage properly. In the last decade, remote sensing data have been widely applied for monitoring vegetation health (1), biomass (2), forest management (3), biodiversity (4), among others (5)(6)(7)(8)(9)(10). Thus, remote sensing approaches have been used to investigate areas with difficult terrain access, demonstrating great potential for the classification and detection of forest vegetation. Remote sensing platforms can be embedded with different sensors such as RGB (Red-Green-Blue), multispectral and hyperspectral, LiDAR (Light Detection and Ranging), and others (11). The identification of arboreous vegetation with remote sensing data depends on the spatial and spectral resolutions (12). LiDAR sensors can produce accurate data on the height of the trees, which is an excellent variable to be adopted by automatic extraction methods in forest environments (13,14). Multispectral and hyperspectral sensors have the advantage of recording the spectral divergence of the vegetation, which is important for enhancing differences between species configurations, health status, etc. (10,15,16). Still, in recent years, high spatial resolution images acquired by RGB sensors have been used in many studies for vegetation identification (9,(17)(18)(19)(20). These sensors have a relatively low cost in comparison with others but offer limited information regarding the spectral range. For single-tree species mapping, the literature already investigated different methods by evaluating multispectral and hyperspectral data (21)(22)(23)(24)(25)(26), airborne LiDAR point clouds (27), and multi-sensory data fusion (28)(29)(30). (31) were able to classify tree species in a temperate forest using satellite multispectral imagery. (32) evaluated UAV (Unmanned Aerial Vehicle)-based multispectral image to map deciduous vegetation. (33) used hyperspectral data to detect boreal tree species at pixel-level, achieving high accuracy for forest mapping. Most of these studies were conducted with hyperspectral sensors and LiDAR sensors. However, both hyperspectral and P R E P R I N T LiDAR data cost and process demand are non-attractive for rapid decision models. This is different from RGB sensors, which have a lower cost and are highly available. (34) demonstrated that the majority of recent applications are implementing RGB imagery data in the vegetation detection scenario. The visual inspection of remote sensing imagery is a timeconsuming, labor-intensive, and biased task. Therefore, various studies have developed multiple methods regarding the automated extraction of the vegetation features (10,15,22). Accurately mapping individual tree species in densely forested environments is still a challenging task, even for more robust methods. The increase in quality and quantity in remote sensing data, alongside the rapid improvement of technological resources, allowed for the development of intelligent methods in the computational vision community. By combining remote sensing data with artificial intelligence techniques, it is possible to properly map tree species and improve accuracy in applications regarding vegetation monitoring. In recent years, multiple frameworks have been implemented to assess the performance of such algorithms to accomplish this task (2,5,15,(35)(36)(37). During the past years, the detection and extraction of trees in high-resolution imagery were performed with more traditional machine learning algorithms, like support vector machine (SVM), random forest (RF), artificial neural networks (ANN), and others (38)(39)(40)(41). They returned interesting outcomes in plenty of studies regarding vegetation analysis (10,(42)(43)(44)(45)(46). However, these learners (known as shallow learners) are limited due to data complexity and may return lower accuracy in comparison with deep learning methods. When considering adverse conditions in a given forest dataset, deeper methods are required. Identifying individual species in a scene can be a challenging task since (23). However, state-of-the-art deep learningbased methods should be capable of identifying single treespecies with an attractive accuracy and computational load even in RGB images. Recently, deep learning-based methods have been implemented in multiple remote sensing, specifically for image segmentation, classification, and object detection approaches (37,(47)(48)(49). Deep learning techniques are among the most recently adopted approaches to process remote sensing data (50)(51)(52). In a general sense, deep learning can return better performance than shallow learners, especially in the presence of large quantities of data or if the input data is highly complex (53,54). In heavy-dense forested environments, the identification of single-tree species can become a challenge even for robust methods like current state-of-the-art deep networks. This motivated several studies to investigate the performance of deep learning methods to evaluate their performance on this task. A recently published research tested the performance of object detection using deep networks like YOLOv3 (55), RetinaNet (56), and Faster-RCNN (57) to detect tree canopy in RGB imagery covering an urban area (9). Another study modified the VGG16 (58) to monitor the health conditions of vegetation (59). A combination of LiDAR and RGB images was also used with the RetinaNet to identify tree-crowns in UAV images (19). The DenseNet (60) was also implemented multispectral data to classify tree species. The spatial and spectral divergences between the tree and non-tree are essential for automatic methods (15,16). In highly-dense scenarios like heavily forested areas, the individual detection of trees could be difficult. RGB sensors are not capable of providing the same amount of spectral data as multispectral or hyperspectral sensors, which offers a potential hindrance for automatic extraction methods. Nonetheless, state-of-the-art deep learning methods based on confidence maps, instead of object detection approaches, could be capable of identifying single tree-species in highly dense areas using RGB images. Methods that could accurately identify a species, among others, may help optimize several applications in environmental planning and forest management. In the presented context, this paper presents a deep learning approach to detect individual fruit species of palm-trees (Mauritia flexuosa; L.f. Buriti) with only aerial RGB orthoimages. As contribution of this approach, a framework to identify and geolocate single species in a high-complex forested environment is demonstrated. The study compares the performance of the proposed method with other state-ofthe-art object detection deep neural networks to test its robustness. The palm tree M. flexuosa is a valuable source of food, remedy, fiber, and light wood for both indigenous communities and local populations (61)(62)(63). It is also considered a native species of the Brazilian flora with both current and potential high economic values (62,64). Besides, this species has its ecological importance, constituting a food source, nest site, and habitat to a wide variety of species and provides multiple ecosystem services (63,65,66), which highlights the need to accurately map this species.

Materials and Method
The method proposed in this paper is composed of three main phases (see Figure 1): (1) The dataset was composed of aerial RGB orthoimages obtained from a riparian zone of a wellknown populated region of M. flexuosa palm-trees. With specialist assistance, the palm-trees in the RGB images were visually identified and labeled in a Geographical Information System (GIS). The image and labeled data were split into groups of training, validation, and testing subsets; (2) The object detection approach was applied in a computational environment; (3) The performance of the proposed method was compared with other networks.

Study Area and Mapped Species.
The riparian zone of the upper-stream of the Imbiruçu brook, located near the city of Campo Grande, in the state of Mato Grosso do Sul, Brazil was selected for the study (Figure 2). This stream, formed by a dendritic drainage system, is inserted in the hydrographical basin of the Paraguay River and covered by the Cerrado (Brazilian Savanna) biome. This area is composed of a highly complex forest patch containing a widespread of palm-tree species Mauritia flexuosa (popular name Buriti). The Arecaceae is a dioecious (67)  urally in flooded areas, providing water balance for rivers and other water bodies. In highly dense, monodominant stands in flooded areas, mature M. flexuosa palm trees reach 20 m high (67). The evaluated site in our experiment, in specific, is one of the well-known locations where a large number of samples of this species is sufficient to train a deep neural network. The aerial RGB orthoimages were provided by the city hall of Campo Grande, State of Mato Grosso do Sul, Brazil. The ground sample distance (GSD) of the orthoimages is 10 cm. A total of 43 orthoimages with dimensions 5,619 x 5,946 pixels were used in the study. This aerial image dataset was composed of 1,394 scenes, where 5,334 palm-trees were manually labeled and used as ground-truth ( Figure 3).
Proposed Method. This study proposes a CNN method that uses the RGB image as an input and, throughout a confidence map refinement, returns a prediction map with tree locations (Figure 4). The objects' position is calculated after a 2D confidence map estimation, based on previous works (68). The first step of the architecture extracts the feature map. In a sequential step, the feature map goes through the Pyramid Pooling Module (PPM) (69). The last step of the architecture produces a confidence map in a Multi-Stage Module (MSM) that enhances the position of the tree by adjusting the prediction to its center.

Feature Map Extraction and PPM.
For the feature map extraction (Figure 4(b)), the proposed CNN is based on the VGG-19 (58). Here, the network is composed of 8 convolutional layers with 64, 128, and 256 filters with a 3 x 3 size window, with Rectified Linear Units (ReLU) functions in all layers. The spatial volume size was reduced in half after the second and fourth layers by a max-pooling layer (2 x 2 window). The PPM (69) was used (Figure 4(c)) to extract global and local information, which helps the CNN to be less variant to tree scale differences. The extracted features are upsam-pled to size equivalent to the input feature map and concatenated with it to create an enhanced version of the image.
Tree Localization. The MSM step (Figure 4(d)) estimates the confidence map from the feature map extracted in the previous module. The MSM is composed of T refinement stages, where the first T stage contains 3 layers of 128 filters with 3 x 3 size, 1 layer with 512 filters of 1 x 1 size, and one final layer with 1 filter that generates the confidence map C1 from the first stage. The position of the trees predicted in the first stage is refined in the T -1 stage phase. In each stage T E [2, 3, ...., T], the prediction (C) is returned from a previous stage (T -1) and the feature map from the PPM module is concatenated. The final layer in this step has a sigmoid activation function since the method considers the probability of occurrence of a tree to exist or not [0,1]. The concatenation process allows for both global and local context information to be incorporated in it. At the end of each stage, a loss function (1) is adopted to avoid the vanishing gradient problem. The general loss function is calculated by the following equation (2).
where Ct(p) is the ground-truth confidence map of location (p) in stage (T).
The confidence map is generated by a 2D Gaussian kernel at the center of the labeled tree. A standard deviation σt controls the spread of the peak for each Gaussian kernel ( Figure  5)  early phases of the experiment, different values for t were adopted to improve its robustness. Finally, the tree location is estimated by the peaks of the confidence map ( Figure 5). These peaks are considered the local maximum of the confidence map, representing a high probability of a tree occurrence. The p = (xp, yp) is considered as a local maximum if CT(p)>CT(v) for all neighbors v. Here, v is given by (xp1, yp) or (xp, yp1). A peak in the confidence map is defined as a real tree if CT(p) > t. To prevent the network from confusing trees in a nearby range from each other, a distance of t is defined. For this study, t equal to 1 pixel and t equal to 0.35 were defined as valid metrics. These values were defined during a previous experimental phase.
Experimental Setup. For the experimental setup, the RGB ortohomosaics were separated into training, validation, and testing, respectively ( Figure 6). They were split into nonoverlapping patches of 256 x 256 pixels (25.6 m x 25.6 m). The patches were then divided into training (42.3%), validation (34.5%), and testing (23.2%) sets. Table 1 lists the number of samples (trees) and image patches, and Figure  6 displays examples of the orthomosaics used to extract the datasets. For the training process, the CNN was initialized with pre-trained weights from ImageNet and a Stochastic Gradient Descent optimizer was applied with a moment equal to 0.9. For this, the validation set was used to adjust the learning rate and the number of epochs, which were set to 0.001 and 100, respectively. The performance of the proposed network was assessed with the following metrics: mean absolute error (MAE); precision (P); recall (R), and; F1-measure (F1). The results were compared with Faster R-CNN and RetinaNet methods. Since these methods are based on bounding boxes, the plant position (x, y) from the labeled ground truth was used as a center of the box. The correct size of the box corresponds with the size occupied by the tree canopy. To perform this comparison, the same conjuncts of training, validation, and testing datasets were adopted for the three methods. Likely, an inverse process was applied during the test phase, where the position of the tree was obtained by the center of the point inside the predicted bounding-box of the RetinaNet and Faster R-CNN methods. This allowed applying the same metrics (MAE, P, R, and F1) for comparing the performances of each neural network.

Results
Validation of the Parameters. The proposed approach parameters σmin, σmax , and the number of stages T, are responsible for refining the prediction map. Initially, the influence of these parameters was evaluated on the M. flexuosa palm trees validation set. Table 2 shows the evaluation of the number of stages T used in the MSM refinement phase. In this experiment, the values of σmin = 1, σmax = 4 and ranges T from 1 to 5 were set, and it was discovered that T = 4 achieved the best performance among the number of analyzed stages, reaching an MAE of 0.852 trees and an F1-measure of 87.1%. Table 2. Influence of the number of stages (T) in counting and detection of M.
flexuosa palms-trees (σmin = 1 and σmax = 4 were adopted). The values of σmin and σmax applied in the refinement stage were also evaluated. For this, the number of stages T = 4 was adopted in the subsequent steps since it obtained the best results in the previous experiment (see Table 2). Since the σminvalues represent the dispersion of the density maps around the center of the trees, it was found that smaller values do not correctly cover the trees and, therefore, can impair the detection. On the other hand, higher σmin values are also harmful as they cover more than one tree per area. Thus, the best results were obtained with σmax = 4, indicating that it fits better with the M. flexuosa palms-trees characteristics, and generates a more accurate refinement map. Table 3 presents the evaluation of different values of σmin responsible for the last stage of the MSM. For this, σmax = 4 and T = 4 were adopted since they obtained better results in the previous experiments (Tables 2 and 3). When σmin= 1, the proposed approach returned the best performance among the analyzed values. Therefore, the refinement step implemented with values of σmin = 1, σmax = 4, and T = 4 P R E P R I N T Fig. 6. Training, validation, and testing datasets separated per region.

Stages (T) MAE Precision (%) Recall
generated a more accurate refinement to the validation set. and F1-measure metrics were calculated for each of them, and the results are displayed in Table 5. The proposed approach achieved high precision and good F-measure values but returned a slight-lower recall value when confronted with them. Nonetheless, it is essential to consider the tradeoff in recall difference (-6.58% from the Faster R-CNN and -12.35% from the RetinaNet) with the precision difference (+14.52 from the Faster R-CNN and +35.49% from the Reti-naNet).
Since the F1-measure uses both the precision and recall values to compute the test results, it can be assumed that the proposed approach performs better and returns a better balance between true-positive predicted and true-positive rates concerning the identification of palm-trees. Nonetheless, the results are consistent with recent literature where object detection applications were applied for the identification of single tree-species (8,9,68,70); but performed in the non-RGB image domain. The low precision values for the bounding box method may be explained by a high density of objects (i.e., M. flexuosa palm-trees). This condition is described as problematic for deep networks based on P R E P R I N T these characteristics, especially when the boxes have high intersections with similar objects (71). To verify the potential of the proposed approach in real-time processing, a comparison of its performance with other state-of-the-art methods was conducted. Table 6   However, even in these few cases, the proposed approach can correctly detect most of the palm-trees. The visual comparison of the palm-tree detection approaches is shown in Figure 9. Column (a) displays the detections obtained by the proposed method, while Columns (b) and (c) are related to the compared methods: Faster R-CNN and Reti-naNet, respectively. The approach that obtained the worst result was RetinaNet (Figure 9(c)), generating many falsepositives (red dots) close to the M. flexuosa palm trees detections. On the other hand, Faster R-CNN (Figure 9(b)), despite having fewer false-positives, did not properly learn the characteristics of the palm-trees and incorrectly detected other tree species among them. Following the quantitative results shown in Table 5, the proposed approach has the greater precision in detecting M. flexuosa palm-trees (Figure 9(a)), while having the least number of incorrect detections (falsepositives).

Discussion
This study demonstrated a feasible method to automatically map single palm-tree species of the M. flexuosa plant genus using an RGB imagery dataset. Mauritia flexuosa frequently occurs at low elevations, with high density on river banks and lake margins, around water sources, and in inundated or humid areas (64). This is one of the most widely distributed palm trees in South America, Brazil. This species occurs in the Amazon region, Caatinga, Cerrado, and Pantanal, and is one of the palm trees mostly used by humans, being an important item in the diet of many indigenous groups and rural communities (64). Mapping M. flexuosa palm-trees is an important practice for multiple regions of South America, like Brazil, where this plant is viewed as a valuable resource. This palm is widely used for several purposes, is considered a species of multiple use (62), occurs in areas of "Veredas", considered protected by the Brazilian forest code, but there is still a great lack of characterization of the habitats of this species in this country.
Mapping and identifying populations of palm M.flexuosa is relevant because it is a reliable indicator of water resources, such as streams inside dense gallery forests, slow-flowing swamp surface water, and shallow groundwater in the Cerrado region, vital for humans and wildlife, besides being a valuable source of several non-timber forest products. The approach provides useful information for sustainable economic use and conservation.
As described, single tree species identification is a challenging task even for state-of-the-art deep neural networks when considering only RGB imagery. Mainly because forest environments are constituted by multiple spectral spatial information, overlapping canopies, leaves and branches, size, growth stages, and density, among others. In this manner, studies considered different data to help solve this issue like density point information, canopy height, digital terrain and surface models, spectral divergence, etc. (4,15,32,40,54). Regardless, in this paper, it is proposed a simplification of this process by adopting little input information (i.e., label features such as points and RGB imagery) and a robust method that once trained, can rapidly perform and resolve the said task even in a real-time context. The results of the present approach achieved satisfactory precision (93.5%), recall (84.2%), and F1-measure (86.9%) values, respectively), and a small MAE (0.758 tree). Studies that applied deep neural networks for detecting other types of arboreal vegetation returned approximated metrics. For the identification of citrus-tree, a CNN method was able to provide 96.2% accuracy (16), and in oil palm-tree detection, a deep neural network implementation returned an accuracy of 96.0% . One different kind of palm trees than the ones evaluated in our dataset was investigated with a modification of the AlexNet CNN architecture and returned high prediction values (R² = 0.99, with the relative error between 2.6% to 9.2%) (70). A study (9) achieved an accuracy higher than 90% to detect single tree-species using the RetinaNet and RGB images. However, in the aforementioned papers, the tree density patterns are differently from ours, and the P R E P R I N    evaluated individual trees are more spaced from each other, which makes a simpler object detection problem.
In the described manner, the proposed method may help in mapping the M. flexuosa palm-tree with little computational load and high accuracy. Since this approach can compute point features as labeled objects, it reduces the amount of label-work required from the human counterpart. Additionally, the method provided a fast solution to detect the palm tree's location with a delivering image detection of 0.073 seconds and a standard deviation of 0.002 using a GPU. This information is essential for properly calculating the cost and effectiveness of the method. The presented approach may help new research while providing primary information for exploring environmental management practices in the experiment context (i.e., evaluating a keystone tree species). The proposed method could also be incorporated into areas and regions to help detect the M. flexuosa palm tree and contribute to decision-making conservation measures of the said species.

Conclusions
This paper presents an approach based on deep networks to map single species of fruit palm trees (Mauritia flexuosa) in aerial RGB imagery. According to the performance assessment, the method returned an MAE of 0.75 trees and F1measure of 86.9%. A comparative study also shows that the proposed method returned better accuracy than state-ofthe-art methods like Faster R-CNN and RetinaNet under the same experimental conditions. Besides, this approach took a shorter time to detect the palm-trees with 0.073 seconds for delivering image detection and achieved a standard deviation of 0.002 using the GPU. In future implementations, it should be possible to add new strategies in this CNN architecture to overcome challenges regarding other tree patterns. Still, the identification of individual species can help to assist in both monitoring and mapping important singular species. As such, the proposed method may assist in new research for the forest remote sensing community that includes data obtained with RGB sensors.