Deep learning with uncertainty quantification for slum mapping using satellite imagery

Over a billion people live in slums, with poor sanitation, education, property rights 1 and working conditions having direct impact on current residents and future generations. A 2 key problem in relation to slums is slum mapping. Without delineations of where all slum 3 settlements are, informed decisions cannot be made by policy makers in order to benefit the 4 most in need. Satellite images have been used in combination with machine learning models to 5 try and fill the gap in data availability of slum locations. Deep learning has been used on RGB 6 images with some success but since labeled satellite images of slums are relatively low quality 7 and the physical/visual manifestation of slums significantly varies within and across countries, 8 it is important to quantify the uncertainty of predictions for reliable application in downstream 9 tasks. Our solution is to train Monte Carlo dropout U-Net models on multispectral 13-band 10 Sentinel-2 images from which we can calculate pixelwise epistemic (model) and aleatoric (data) 11 uncertainty in our predictions. We trained our model on labelled images of Mumbai and verified 12 our epistemic and aleatoric uncertainty quantification approach using altered models trained 13 on modified datasets. We also used SHAP values to investigate the how the different features 14 contribute towards the model’s predictions and this showed that certain short-wave infrared 15 and red-edge image bands are powerful features for determining the locations of slums within 16 images. Having created our model with uncertainty quantification, in the future it can be applied 17 to downstream tasks and decision-makers will know where predictions have been made with low 18 uncertainty, giving them greater confidence in its deployment. 19


Introduction
Globally, nearly one billion people lives in slums and the figure is estimated to 23 double by 2030. Implicit and explicit social and economic constraints on slum residents 24 result in poor quality of life [1]. In Mumbai, one of the two focus cities for our study, The U-Net architecture has seen widespread success in applications to biomedical 48 imagining, geographical mapping, camera feeds from autonomous vehicles and video 49 segmentation [4][5][6][7]. 50 Whilst so far convolutional models have been used in this area, the U-Net specifi-51 cally has not seen application to the slum mapping problem we focus on in this paper, 52 shown in Figure 1. To date, convolutional deep learning models have only been applied 53 to RGB very high resolution (VHR) imagery and only frequentist (i.e. non-Bayesian) 54 approaches have been taken [3,8]. This means that one particular set of model weights 55 is assumed to be optimal rather than taking a Bayesian approach where a posterior 56 distribution of possible weights is considered. 57 [9] performed satellite image segmentation using convolutional neural networks 58 (CNNs) with a FCN-VGG19 architecture. They also experimented with transfer learning 59 by using model weights from pretraining on the ImageNet database. They found that 60 this transfer successfully increased the Intersection over Union (IoU) score for the model, 61 providing significantly higher scores than when transferring using a model pretrained 62 on images from a different slum from the same country. Some different transfer learned 63 FCN-VGG19 models are investigated further by [10]. Their transfer learning was shown 64 to be more effective than not using transfer learning when predicting on other land cover 65 classes like vegetation. However they found that the best model for predicting a slum 66 class specifically is a FCN-VGG19 using QuickBird VHR data. This model was able 67 to achieve validation recall of 86% and IoU of 77%. We note that calculating either of 68 these metrics require a threshold for the continuous model output in [0, 1] to be chosen 69 resulting in one specific classifier, rather than measuring the classifying power of the 70 model overall in a way that allows for more consistent comparison between models. 71 The VHR model used by [8]    with AUROC or AUPRC. The fact that different metrics have been used does not allow 92 for comparison between models to establish the state of the art model for this task.

93
It is generally found across all authors that the generalisation of models trained 94 in one slum, when evaluated on a different unseen slum, results in poor performance.

95
This is likely due to a variety of appearances caused by differing building materials, 96 space constraints and geographical features. These cause good model parameters to 97 vary between geographical areas, resulting in poor transferability [12].

98
This lack of generalisation ability does not allow for reliable applications to down-99 stream tasks that use the model in a broader application. The rarity of quality training 100 data and the interpretability and transferability of models represent pressing problems 101 that require further work in this area. It has been repeatedly emphasised that understand-102 ing the levels of uncertainty is an important but as of yet unexplored area of study and 103 represents an important work to be done in order to improve slum model interpretability 104 and deployability [8,12]. If, for example, we were to use the mapping to estimate the  In this paper, we aimed to teach an algorithm to be able to produce a map of slums  and showed that it has good unseen test set performance using AUROC and AUPRC 125 metrics. We verified our epistemic and aleatoric uncertainty quantification approach 126 using altered models trained on modified datasets.

127
This is the first investigation into either multispectral deep learning models or 128 uncertainty quantification for the slum mapping problem.
Sustainable Development Goal 11 of inclusive, safe, resilient, and sustainable cities and human settlements [14].   We teach a model f to be good at producing targets from given input images so  We use the U-Net architecture as initially proposed by [4]. This model has been 146 shown to beeffective at image segmentation tasks in a wide variety of applications. separate pixel classification is important as emphasised in [3]. 157 We used a normaliser trained on the training data to help the gradient descent 158 process converge more quickly during training. 159 We use Monte Carlo dropout [16] between every layer of the architecture which 160 intuitively randomly kills off a small number of weights in the model. Each of these 161 different randomly dropped-out models make a slightly different prediction, which 162 simulates the process of sampling from the distribution on the weights. Dropout also 163 prevents overfitting during training [17]. Monte Carlo dropout is explained further in 164 Supplementary Material and [17]. 165 We use 500 dropout models in our approach to ensure that we sample a relatively 166 wide range of models from the distribution. The dropout rate for all the of models was 167 0.25. 168 We used the popular Adam optimiser [18] with initial learning rate 0.001. The

169
Adam optimiser is adaptive and so we do not expect results to change significantly when 170 altering this value. We trained our model for 100 epochs with a batch size of 128. 171 We use the common binary crossentropy loss as outlined in Supplementary Mate-172 rial. 173

174
We use two locations with slum maps in this work. The first is Mumbai (India), 175 where in 2011 PK Das & Associates produced an award-winning map of the city's slums.

176
The second is Bogota (Colombia), where in 2018 Gram-Hansen produced a slum map of 177 the city and used it to train their own models [8]. 178 We used 10 meter Resolution 13-band multispectral imagery from the Sentinel-2 Material. 188 We used 10% of the dataset for validation and 10% of the dataset as a separate test 189 set. The other 80% was used for training. The data was randomly shuffled before the 190 data split was performed.

199
The epistemic uncertainty here quantifies the uncertainty as called for in [12]. As the dataset is highly imbalanced with many more not-slum pixels than slum 204 pixels, accuracy alone is not a good measure of the performance of a model. Only about 205 10% of the pixels in the dataset are slum so a naive classifier that always assigns pixels 206 to the "not slum" category has an accuracy 0.9.

207
Some authors use use intersection over union or individual precision and recall 208 scores for a specific output threshold output to measure the quality of their algorithm. is typically more informative than AUROC for imbalanced datasets and does not give a 218 false sense of high performance on imbalanced classification tasks [21].

219
Both of these two metrics measure the overall predictive power of models without  Predictions and uncertainty for different tiles when using the U-Net model. Each row shows the tile, the ground truth label without masking (yellow denotes slum and dark purple denotes not slum), the Bayesian U-Net prediction along with the aleatoric and epistemic uncertainties in these predictions. Note that we have shown higher resolution RGB images in the first column whilst lower resolution multispectral images were actually used by the model.

224
We used 10 meter multi-spectral satellite images from the European Space Agency 225 Sentinel-2 satellite along with pre-existing high quality slum maps from [8] and [15] 226 from which we produced pixel-wise labels of where slums are located within the satellite   image are at a lower resolution making the urban cover types difficult to see. 254 We found that generally aleatoric uncertainty is typically very high (in fact, some-  Data histogram has much more mass distributed at higher epistemic uncertainty values 296 above 0.1 than the other two models.
297 Figure 3b shows us that all three models have essentially the same epistemic 298 uncertainty on the not-slum pixels in the test set. This is as the proportion of pixels 299 in the not-slum category is much higher than the slum category and so the Half Data 300 model is still exposed to sufficient data in the not-slum category. The Class Flip model is 301 trained with some pixels incorrectly relabelled from slum to not-slum, hence this Class

302
Flip model only has slightly increased not-slum epistemic uncertainty in the 0 to 0.05 303 range compared to the other two models.

304
Overall the epistemic uncertainty is substantially higher when using the Half Data 305 model. Using our dropout model this change is measurable at the pixelwise level 306 by plotting these histograms. Figure 4 shows that there is little change in aleatoric 307 uncertainty between the Standard and Half Data models. Similarly, Figure 4b shows that whilst the aleatoric uncertainty histrograms of the 316 three models on not-slum pixels in the test set are similar, again the the Class Flip model 317 has the highest mass of the three at higher uncertainty values. 318 We conclude that the Class Flip model has the highest aleatoric uncertainty of the 319 three models. Figure 3 shows that there is little change in epistemic uncertainty between    Bogota. An example of such an applications might be slum monitoring where changes 387 in slum geography are tracked over time to help inform policy decisions. 388 We experimented with different model variants to measure the impact separately 389 of reduced quality and reduced quantity of training data. We observed the changes in  data rather than census data exists for these areas [2]. We have assumed that we can 429 overcome this difficulty by training models on large amounts of data to get reasonable 430 performance.

431
As we have shown that our model is of high quality -it is able to achieve high levels 432 of performance on unseen test data whilst proving the added benefits of interpretability 433 and uncertainty quantification. This means that our models is appropriate for deploy- health. We would like to see future papers using this model in these kinds of tasks in 437 order to provide better data for policy makers to inform their decisions. 438 We think that it is important for researchers to use the same metrics allowing for 439 model performance comparison to make the top performing start the art model clear. As 440 explained in Section 5.5, the use of AUPRC is much more appropriate for this problem 441 than the metrics used by researchers previously, and a consistent 442 We also think that the questions of model transferability brought up by [3], [9] and 443 [12] is an important area of study which yet remains unsolved.