Super-Resolution of sentinel-2 images at 10m resolution with- out reference images

Sentinel-2 can provide multi-spectral optical remote sensing images in RGBN bands with a spatial resolution of 10m, but the spatial details provided are not enough for many applications. WorldView can provide HR multi-spectral images less than 2m, but it is a commercial paid resource with relatively high usage costs. In this paper, without any available reference images, Sentinel-2 images at 10m resolution are improved to a resolution of 2.5m through super-resolution (SR) based on deep learning technology. Our model, named DKN-SR-GAN, uses degradation kernel estimation and noise injection to construct a dataset of near-natural low-high-resolution (LHR) image pairs, with only low-resolution (LR) images and no high-resolution (HR) prior information. DKNSR-GAN uses the Generative Adversarial Networks (GAN) combined of ESRGAN-type generator, PatchGAN-type discriminator and the VGG-19-type feature extractor, using perceptual loss to optimize the network, so as to obtain SR images with clearer details and better perceptual effects. Experiments demonstrate that in the quantitative comparison of the non-reference image quality assessment (NR-IQA) metrics like NIQE, BRISQUE and PIQE, as well as the intuitive visual effects of the generated images, compared with state-of-the-art models such as EDSR8-RGB, RCAN and RSESRGAN, our proposed model has obvious advantages.


Introduction
Satellite remote sensing images have important applications in many fields, such as agriculture, environmental protection, land use, urban planning, natural disasters, hydrology, climate and so on [1]. With the continuous updating of optical instruments and other equipment, the spatial resolution of satellite images is constantly improving. For example, Worldview-3/4 satellite can collect 8 bands of multi-spectral data with a ground resolution of 1.2m [2]. However, Worldview-3/4 satellite data need to be paid for its use, and when covering a large area or performing a multi-temporal analysis, it will be restricted by the data cost. Therefore, open access data with acceptable spatial quality can be considered, such as Landsat [3] or Sentinel [4]. Sentinel-2 updates remote sensing images of every location in the world for free approximately every 5 days, and these remote sensing images are becoming more and more important resources for applications. Sentinel-2 uses two satellites to achieve remote sensing coverage at the equator on a global scale and provides a multi-resolution layer composed of 13 spectral bands, among which 10m resolution images are provided in 4 bands of the visible lights in red (B4), green (B3) and blue (B2) and the near-infrared (B8), 20m resolution images provided in 6 bands, and 60m resolution images provided in the other 3 bands respectively [4]. The bands of 10m and 20m resolution are usually used for land cover or water mapping, agriculture or forestry, while the band of 60m with lower resolution is mainly used for water vapor monitoring [5]. Due to the open data distribution strategy, the 10m resolution remote sensing images provided by Sentinel-2 are becoming important resources for some applications. However, such spatial resolution is still slightly insufficient in many applications. In order to make full use of the free availability of Sentinel-2 images, and to achieve the spatial order to solve this problem, inspired by blind SR model KernelGAN [39] and the blind image denoising model [40], we explicitly estimate the degradation kernel of LHR image pairs of natural images through GAN, estimate the distribution of the degraded noises at the same time and degrade the 10m resolution images of Sentinel-2 to construct near-natural LHR image datasets. On the basis of these datasets, with the references to SRGAN, PatchGAN and VGG-128 network structure, DKN-SR-GAN is designed to implement SR of Sentinel-2 images from 10m to 2.5m.

Dataset
For the convenience of the following analysis, we initially present the datasets used in training and testing. The model proposed in this paper is aimed at Sentinel-2 images, so we use SEN12MS [41] dataset to train and test the models. SEN12MS contains complete multi-spectral information in geocoded images, it also includes SAR and multi-spectral images provided by Sentinel-1 and Sentinel-2, and adds land cover information obtained by MODIS system. This paper mainly focuses on 10m resolution images of red (B4), green (B3) and blue (B2) bands in multi-spectral images, namely, RGB color images with 10m resolution. SEN12MS gives Sentinel-2 cloudless images of the region of interest (ROI) at specified time intervals. SEN12MS divides the images into patches with 256x256 pixels, which span 128 pixels so that the overlap rate between the adjacent patches is 50%. SEN12MS takes 50% overlap as an ideal compromise between the independence of patches and the maximum number of samples. SEN12MS dataset obtains randomly sampled ROI based on four seeds (1158, 1868, 1970 and 2017), and the distribution of ROI is shown in Figure 1. In this paper, DKN-SR-GAN uses a dataset of SEN12MS, named ROIs1158, which is composed of 56 regions of interest across globe generated from 1158 seeds from June 1, 2017 to August 31, 2017. ROIs1158 is divided into 56 subsets by region, totally 40883 pieces of 256x256 pixel images. This paper randomly selects the subset "ROIs1158_spring_106" as the test dataset (ROI_Te), which contains 784 test images; while for the remaining 55 subsets, including 40099 images, are used as the source images dataset (ROI_Src), and ROI_Src is degraded to generate LR image dataset (ROI_LR). The source images in ROI_Src are directly used as HR images in the training, which forms LHR image pairs dataset (ROI_Tr) with the images in ROI_LR dataset one by one. This paper compares the performance of the newly proposed models including EDSR8-RGB [32], RCAN [21], and RS-ESRGAN [33], as well as the traditional model of BiCubic [42]. Bicubic directly uses ROI_Te for interpolation test without training; RCAN takes the images in ROI_Src as LR images, and generates HR images by BiCubic-interpolating every image to form a LHR image pair dataset; the models of EDSR8-RGB and RS-ESRGAN respectively refer to the models proposed in [32] and [33] to construct a dataset based on ROI_Src.

Structure of DKN-SR-GAN
This paper use DKN-SR-GAN to generate 2.5m resolution images from 10m resolution source images of Sentinel-2 in two stages. In the first stage, KernelGAN is used to implement the estimation of the explicit degradation kernel of images, and then combined with injecting the degraded noise, the source images are degraded to LR images , which will combine with HR image (equivalent to ) to construct LHR image pairs ( ， ). In the second stage, the dataset ( , ) is used to train the super-resolution generative adversarial network (SR-GAN), which consists of a super resolution generator (SR-G), a super resolution discriminator (SR-D), and a super resolution perceptual feature extractor (SR-F). DKN-SR-GAN represents Sentinel-2 image SR model proposed in this paper, and the structure of DKN-SR-GAN is shown in Figure 2.

Degradation Kernel Estimation and Noise Injection
Here we introduce an image degradation model based on kernel estimation and noise injection. The natural pairing relationship between low and high resolution images can be approximately understood as the degradation relationship between HR images and LR images, and the degradation process can be expressed as: = ( * ) ↓ + (1) Where, and represents degradation kernel and degraded noise respectively, and represents scaling factor. The quality of degradation kernel and degraded noise determines the relevance between LHR image pairs and natural image pairs, as well as accuracy of the extracted mapping features between low and high resolution images, which further determines the quality of images generated by SR.

Degradation Kernel Estimation Based on KernelGAN
Here we first consider the noise-free degradation process, assuming that the noisefree LR image _ is the result of downsampled HR image by using the degradation kernel through the scaling factor : _ = ( * ) ↓ (2) In this paper, KernelGAN is used to estimate the image degradation kernel , which is a blind SR degradation kernel estimation model based on Internal-GAN [43], and a completely unsupervised GAN requiring no extra training data except the image itself [39]. KernelGAN uses only the images for training to learn the distribution of internal pixel patches, with the goal to find the image-specific degradation kernel and to search for the best degradation kernel to retain the distribution of pixel patches on each scale of the image . More specifically, our goal is to "generate" downsampled images and to make the pixel patch distribution of the downsampled images as close to the images as possible. The essence of the model is to extract the cross-scale recursive characteristics between LR and HR images through deep learning, and GAN in KernelGAN can be understood as the matching tool for pixel patch distribution. The implementation process of KernelGAN is shown in Figure 3, by training on a single input image to learn the distribution of internal pixel patches of the cropped patch. KernelGAN consists of a kernel generator (Kernel-G) and a kernel discriminator (Kernel-D). Both the kernel-G and the kernel-D are fully convolutional, which means that the network is applied to the pixel patch rather than the whole image. With the given input of the images , the kernel generator will learn to downsample then to _ , whose goal is to make the discriminator indistinguishable from the input images at the pixel patch level. The objective function of KernelGAN is defined as: Where, G represents generator, D represents discriminator. And ℛ is the regularization term optimized by degradation kernel : Where, ℒ 、ℒ 、ℒ 、ℒ represent losses, and 、 、 、 represent constant coefficients. In this paper, the constant coefficients are set according to experience as = 0.5, = 0.5, = 5, = 1. The losses are defined as following equations respectively: Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 21 April 2021 doi:10.20944/preprints202104.0556.v1 Where, , represents the parameter value of each point of the degradation kernel, and the goal of ℒ _ is that the sum of { , } is 1.
The goal of ℒ b is to punish the non-zero value near the boundary, and , is the constant mask of weight, which increases exponentially with the distance from the center of { , }.
The goal of ℒ sp is the sparsity of , to avoid excess smoothness of interior kernel.
The goal of ℒ c is to make the center of { , } in the center of the interior kernel. ( , ) represents the indices of the center.
Kernel-G can be regarded as an image downsampling model, which implements linear downsampling mainly through convolution layer, and the network contains no nonlinear activation unit. Nonlinear generator is not used here because it is possible for the nonlinear generator to generate physically unnecessary solutions for the optimization targets, for example, to generate an image that is not downsampled but contains effective pixel patches. In addition, because the single-layer convolution layer cannot converge accurately, we use the multi-layer structure of linear convolution layers as shown in Figure  4. The goal of Kernel-D is to learn the distribution of pixel patches in the input images and distinguish between the real patches and fake patches in the distribution. The real patches are cropped from the input images , while the fake patches are cropped from _ generated by the kernel-G. We use the full convolution pixel patch discriminator introduced in [44] to learn the pixel patch distribution of every single image as shown in Fig. 5.
The convolution layer used in the kernel-D does not perform pooling operations, so as to act on each pixel patch implicitly, and finally generate a hot map (D-map), of which each position corresponds to one cropped patch input. The hot map output by the kernel-D represents the possibility of each pixel extracting the surrounding pixel patches from the original pixel patch distribution, and used to distinguish the real patches from the fake patches. The loss is defined as the pixel-wise mean square error between the hot map and the label map. Label map refers to all 1 labels of the real patches and all 0 labels of the fake patches. After the training of KernelGAN, we do not focus on the generator network, but convolute the convolution layers of the kernel-G with the stride of 1 successively to extract the explicit degradation kernel. Meanwhile, the training of KernelGAN is based on one single input image , which means that each input image trains one degradation kernel, and many degradation kernels generated by the training image set will be randomly selected and used in the subsequent steps. The graphical examples of some degradation kernels are shown in Figure 6.

Generation and Injection of Noise
We explicitly inject noise into the downsampled images _ to generate realistic LR images . In the process of image downsampling, the high-frequency information will be lost, so the distribution of noise will change at the same time. In order to ensure that the degraded images have a similar noise distribution as the source images , we extract the noise mapping patches directly from the source images in the training dataset. Due to the large variance of the patches with rick contents [38], and inspired by [40,45], when extracting noise mapping patches we control the variance within a specific range under the condition: ( ) < σ (9) Where, (·) represents the variance function, and σ represents the maximum value of the variance. The noise mapping patches are extracted from images selected from the images of ROI_Src randomly, and a certain number of noise patches are extracted to construct the dataset (ROI_Noi). The noise mapping patches used for noise injection process are randomly selected from ROI_Noi.
To sum up, the process of generating LR images in ROI_LR from the source images in ROI_Src can be expressed as Equation (10), where i and j are randomly selected:

SR-GAN
SR-GAN consists of super resolution generator (SR-G), super resolution discriminator (SR-D) and perceptual feature extractor (SR-F). SR-G is designed on the basis of ESRGAN [26] model. Because ESRGAN discriminator may introduce more artifacts [38], SR-D is designed on the basis of PatchGAN [44] model. The perceptual feature extractor is designed on the basis of VGG-19 [46], so as to introduce the perceptual loss [47] to enhance the visual effect of low-frequency features of the images.
The loss ℒ of SR-GAN consists of three parts, including pixel-wise loss ℒ [26], perceptual loss ℒ and adversarial loss ℒ . ℒ = ℒ + p ℒ + a ℒ (11) Where, , p and a are constant coefficients, and the constant coefficients are set according to experience as = 0.01、 p = 1、 a = 0.005. The losses ℒ , ℒ and ℒ are defined as equation (12), (13) and (16). ℒ = ∥ ∥ ( ) − ∥ ∥ (12) Pixel-wise loss ℒ uses L1 distance to evaluate the pixel-wise content loss between the generated images ( ) and the real images . ℒ = ℒ + ℒ (13) Perceptual loss ℒ evaluates the perceived differences in content and style among different images, and consists of feature reconstructing loss ℒ related to content and style reconstructing loss ℒ , where and denotes constant coefficients, and ℒ and ℒ can be expressed as: Where ( ) represents the characteristic diagram obtained at level of the convolution layer after the image inputs SR-F, and the shape of the obtained characteristic diagram is × × (Channel × Height × Width) and ∥·∥ represents square Frobenius norm.
Adversarial loss ℒ is used to enhance the texture details of the generated image to make it look more realistic. The structure of SR-G is shown in Fig. 7. Based on ESRGAN model, and adopting RRDB [39] structure, it is trained in the constructed LHR image pairs ( , ) and the resolution of the generated images will be magnified x4.
Due to the discriminator in ESRGAN model may introduce more artifacts, this paper uses the patch discriminator instead of VGG-128 discriminator in ESRGAN model, and SR-D is designed based on PatchGAN [44] model. In addition, the patch discriminator is used instead of VGG-128 discriminator out of consideration for the following aspects: VGG-128 limits the size of the generated image to 128, which makes it inconvenient to conduct the multi-scale training; VGG-128 uses a fixed fully-connected layer, which makes the discriminator pay more attention to the global features and ignore the local features [34]. We use the patch discriminator with a fully convolutional structure and a fixed receptive field. Each output value of SR-D is only related to the patches in the local fixed region, so that we can optimize the local details. The average value of all local errors are used as the final error to guarantee the global consistency. The structure of SR-D is shown in Figure 8.
Based on VGG-19 [46] model, this paper introduces the perceptual feature extractor to extract the perceptual loss ℒ , that is, to extract the inactive features in VGG-19. The perceptual loss can enhance the low-frequency features of the images and make the images generated by the generator look more realistic. The structure of the perceptual feature extractor is shown in Figure 9.
DKN-SR-GAN first generates a LHR image pair dataset (ROI_Tr) based on a training dataset (ROI_Src) for training and testing. We randomly select 2134 images from 40099 images of ROI_Src to generate a degraded kernel dataset (ROI_Ker) through KernelGAN training one by one, namely ∈ {ROI_Ker}, ∈ {1,2 ⋯ 2134}; and then randomly select 4972 images from 40099 images of ROI_Src to extract noise patches one by one to form a noise patch dataset (ROI_Noi), namely ∈ {ROI_Noi}, ∈ {1,2 ⋯ 4972}; finally, we use the degradation kernel and injected noise to perform degrading operations on the images in ROI_Src one by one. In the processing of each image, the degradation kernel and injected noise are randomly selected from ROI_Ker and ROI_Noi.
The network structural parameters of the kernel-G and the kernel-D and the constant coefficients of losses of KernelGAN have been mentioned above, so we will not repeat them here. In the training phase, both the generator and the discriminator adopt ADAM optimizer with the parameters = 0.5、 = 0.999, the learning rates of the generator and the discriminator are both set to 0.0002, decrementing by x0.1 every 750 iterations, and the network is iteratively trained 3000 epochs.
SR-G uses "RRDBNet" model in "BasicSR" project, and SR-D uses "NlayerDiscriminator" model in "Real-SR" project. The network structural parameters and the constant coefficients of losses have been mentioned above, therefore we will not repeat them here. The image is magnified by 4 times, and during the training phase, both the generator and the discriminator adopt ADAM optimizer with the parameters = 0.9, = 0.999, the learning rates of the generator and the discriminator are both set to 0.0001, and the network is iteratively trained 60,000 epochs.
Many convolutional layers are used in KDN-SR-GAN, and these convolutional layers play a vital role. After many tests, it is known that the parameters of the convolutional layer in the network need to be set to the values shown in Table 1, to achieve the x4 resolution images by KDN-SR-GAN and obtain the image quality we want.
EDSR8-RGB, RCAN, and RS-ESRGAN models implement training and testing under the framework of BasicSR [48], and adopt the parameter setting schemes which have been proven to achieve better results in references [21,32,33], and the parameters used in the implementation are detailed in Table 2.  3  64  3  1  0  1  1  True  'zeros'  64  64  3  1  0  1  1  True  'zeros'  64  128  4  2  0  1  1  True  'zeros'  128  128  4  2  0  1  1  True  'zeros'  128  256  4  1  0  1  1  True  'zeros'  256  256  4  1  0  1  1  True  'zeros'  256  512  4  1  0  1  1  True  'zeros'  512  512  4  1  0  1  1  True  'zeros'  512  1  4  1  0  1  1 True 'zeros' Because the source images used are already the highest resolution (10m) images of Sentinel-2, there are no real ground truth images (2.5m resolution) that can be compared with the generated images in reality, and some image quality assessment metrics commonly used, such as, PSNR, SSIM, etc., are no longer applicable in this scene. Therefore, this paper adopts non-reference image quality assessment (NR-IQA) metrics, including NIQE [49], BRISQUE [50] and PIQE [51]. NIQE is a fully-blind image quality assessment model, and it establishes a "quality awareness" statistical feature set based on a simple and effective statistical model under a natural scene in the spatial domain, and only uses the measurable deviations of the statistical regularity observed in natural images for training. BRISQUE is a general non-reference image quality assessment model based on natural scene statistics in the spatial domain. BRISQU does not calculate the distortion-specific features, but uses the scene statistics of locally normalized luminance coefficients to quantify the possible "natural" losses. Without any training data, PIQE quantifies distortion, and relies on extracting local features to evaluate the image quality. range of [0, 100], where the lower number indicates high perceptual quality and the higher number indicates low perceptual quality. This paper randomly selects one sub-dataset "ROIs1158_spring_106" in ROIs1158 as the testing dataset (ROI_Te) containing 784 images. The remote sensing images in ROI_Te are collected from the ground areas as shown in Figure 10. In the figure, we marked 8 regions with strong geographic features, and the x4 generated images of these regions will be listed subsequently to visually compare the differences among those models.  Table 3. It can be seen from the histograms and Table 3 that our proposed DKN-SR-GAN model is superior to other models in a variety of non-reference image quality assessment metrics.   21 show the generated images of 8 regions with strong geographic features selected in "ROIs1158_spring_106" to visually compare the differences between different models. Through the comparison of the images of various terrains in Figures 14-21, it can be obviously seen that the images processed by traditional BiCubic method are bleariest and smoothest due to the inherent deficiencies of the interpolation algorithm. EDSR8-RGB, RCAN and RS-ESRGAN models cannot correctly distinguish the noises with sharp edges, resulting in blurred results, and even indistinguishable for houses and roads. As shown in our DKN-SR-GAN results, the dividing lines among the objects and the backgrounds such as roads, bridges and houses, are much clearer, which indicates that the noise estimated by noise injection is closer to the real noise. Compared with EDSR8-RGB, RCAN and RS-ESRGAN models, our DKN-SR-GAN results are clearer and have no ambiguity.

Conclusion
In this paper, based on the latest and widely-recognized GAN technologies such as KernelGAN, ESRGAN, PatchGAN and so on, we introduce the degradation kernel estimation and noise injection, to perform SR for Sentinel-2 satellite remote sensing images, and improve the original images with the highest resolution of 10m to 2.5m. Through the combination of the degradation kernel and injected noise, we obtain LR images in the same domain as the real images, and get the near-natural LHR image pairs. On the basis of near-natural LHR image pairs, we use GAN combined of ESRGAN-type generator, PatchGAN-type discriminator and VGG-19-type feature extractor, use the perceptual loss, and focus on the visual characteristics of the images, so that our results have clearer details and better perceptual effects. Compared with state-of-the-art SR models of Sentinel-2 such as EDSR8-RGB，RCAN and RS-ESRGAN, the main difference of our model lies in the construction of LHR image pairs for the training datasets. In the scene training with natural LHR image pairs, there is no significant difference in the effect for SR images obtained by those models; however, in the scene with only LR images and no HR prior information, compared with RCAN which constructs the image pairs through BiCubic, with EDSR8-RGB and RS-ESRGAN which use WorldView satellite HR images to construct the image pairs, KDN-SR-GAN have obvious advantages in the quantitative comparison of the nonreference image quality assessment and the intuitive visual effects.  Data Availability Statement: The x4 images by models KDN-SR-GAN, BiCubic, EDSR8-RGB， RCAN and RS-ESRGAN are available online at Baidu Wangpan(code: mbah). And all data, models, and code generated or used during the study will be available at GitHub soon.