Convolutional neural networks with deep supervised feature learning for remote sensing scene classification

State-of-the-art remote sensing scene classification methods employ different Convolutional 1 Neural Network architectures for achieving very high classification performance. A trait shared 2 by the majority of these methods is that the class associated with each example is ascertained by 3 examining the activations of the last fully connected layer, and the networks are trained to minimize 4 the cross-entropy between predictions extracted from this layer and ground-truth annotations. In 5 this work, we extend this paradigm by introducing an additional output branch which maps the 6 inputs to low dimensional representations, effectively extracting additional feature representations 7 of the inputs. The proposed model imposes additional distance constrains on these representations 8 with respect to identified class representatives, in addition to the traditional categorical cross-entropy 9 between predictions and ground-truth. By extending the typical cross-entropy loss function with 10 a distance learning function, our proposed approach achieves significant gains across a wide set of 11 benchmark datasets in terms of classification, while providing additional evidence related to class 12 membership and classification confidence. 13


Introduction
Classification of satellite and aerial imagery is a quintessential problem in remote sensing observation analysis. Although the variation in appearance and environmental conditions make this a very challenging problem, in the past few years, machine learning methods have led to a dramatic leap in performance [1,2]. In supervised image classification, both for generic imagery and remote sensing scenes, the current gold standard involves designing an appropriate Convolutional Neural Network (CNN) and training the network so that the cross-entropy loss between the predicted and the ground-truth labels is minimized. The cross-entropy is defined as the KL-divergence between the ground truth labels, encoded as a binary vector where all but one element are zero (one-hot encoding), and the predicted class label distribution from the network, obtained from the last layer after appropriate scaling via a soft-max activation function [3]. Although different loss functions have been proposed for non-classification tasks like focal loss in object detection [4] and mean-squared-error for image enhancement [5], the majority of state-of-the-art CNN architectures employ cross-entropy as the loss function to be minimized for scenarios like supervised image classification [6].
While training a CNN based on the accuracy of the predictions, quantified by the cross-entropy, is a valid and very successful approach, it fails to consider the characteristics of the features extracted by the network. Although feature extraction is an integral part of the CNN classification pipeline, unlike previous state-of-the-art methods that decouple feature extraction and classification, typically, the outputs of an intermediate layer, relatively close to the output layer, are considered as the features and the final fully connected layer produces the classification results [7]. As such, while end-to-end training of CNNs, using variants of gradient descent, target the minimization of the the classification error, a wealth of information encoded in features extracted from the last fully connected layer, by large, remains untapped.
Motivated by the power of these "features" to encode useful information, in this work we extend the typically CNN architectural pipeline by proposing the addition of an additional output branch that generates low dimensional embedded features extracted from a given input. The proposed scheme, called CNN with Supervised Feature Learning (CNN-SFL) introduces additional terms in the typical loss function which capture the assumption that features extracted from images from the same class when appropriately embedded in a low-dimensional space, must be clustered together. In other words, the feature representation of an example should be close to a representative from the same class and far away from representatives of different classes. As such, the main contributions of this paper are summarized as follows: • We propose the CNN-SFL, a universally-applicable scheme that can extend and enhance traditional CNN architectures. • We demonstrate that the proposed scheme achieves better between class separation compared to traditional approaches by employing distances to class representatives. • We provide experimental evidence that the proposed architecture surpasses, in terms of prediction accuracy, existing state-of-the-art methods, through an evaluation on three benchmark datasets. • Our analysis also indicates that the proposed scheme is able to provide significantly more separable representations between different classes compared to more traditional approaches.
The remainder of the paper is organized as follows. In Section 2, we outline the state-of-the-art in deep learning based remote sensing image classification, focusing on research closely related to the proposed work. In section 3 we present in detail the proposed CNN-SFL method, while in Section 4 we report the experimental results and comparisons with state-of-the-art methods. The paper concludes in Section 5.

Related work
Scene classification represents a critical problem in the domain of remote sensing and numerous approaches have been proposed. In the past five years, the majority of methods considered for this problem employ the deep learning framework and most typically rely on Convolutional Neural Network architectures [1,8], employing categorical cross-entropy for the final class prediction. The idea of introducing the relationship between features in addition to the cross-entropy was considered in [9] where the authors introduced a distance metric loss between the features extracted in the one-before-last layer of a CNN architecture. The core idea of this work is that in addition to minimizing the classification error, the network should also produce features that exhibit low inter-class and high intra-class distances.
Kang et. al [10] recently proposed the augmentation of the typical cross-entropy loss function with an additional term encoding the distances between features extracted through the Scalable Neighborhood Components Analysis (SCNA) method. The SCNA is an extension of the neighborhood component analysis, a supervised dimensionality reduction method that seeks to learn a metric space such that the leave-one-out KNN score is maximized. In [11], the authors utilized the features extracted from the last convolutional block layer of a CNN as input to a random forest classifier for remote sensing scene classification. In [12], the authors proposed diversity-promoting deep structural metric learning for remote sensing scene classification. Distance metric learning using CNN was also employed in [13] for remote sensing image retrieval where the similarity between a query image and the image dataset is evaluated on the features extracted from a CNN.
In the past few years, different distance metrics have been proposed in the context of Deep Metric Learning [14]. Contrastive loss [15] aims at minimizing the distance between pairs of examples from the same class and maximize the distance between pairs of examples from different classes through the introduction of the appropriate loss function terms using Siamese network architectures. The triplet loss [16] considers triples which consist of a query example, an example from the same class, and an example from a different class. The objective in this case is to minimize the distance between examples from the same class while maximizing the distance between examples from different classes. In [17], K. Sohn proposed the N-pair loss which extended the triplet loss by considering N-1 negative examples simultaneously. The angular loss [18] considers the cosine similarity between examples, and offers scale and rotation invariance.

Overall CNN-SFL design
The proposed CNN-SFL is a universal extension of typical CNN architectures for image classification where instead of a single output, the network provides two outputs, one corresponding to the one-hot vector encoding the class predictions, and another providing the low-dimensional features. The key idea behind the CNN-SFL framework, is that traditional CNN architectures seeking to minimize the disparity between predicted and ground-truth classes, can be augmented with an additional constraint enforcing similarity between the feature representations of each example and the associated feature space of representatives corresponding to the same and to different classes. An illustrative block diagram of the propose CNN-SFL framework is shown in Figure 1 which showcases the three components of the composite loss function, namely L 1 , L 2 , L 3 representing classification error, distances with between examples and same class representative, and distance to different class representative, respectively.

CNN-SFL network architecture
Let X = {x i |i = 1, 2, · · · , N} be a set of N training examples and Y = {y i |i = 1, 2, · · · , N} the corresponding classes associated with each example. In our model, we assume that labels are encoded in a one-hot representation such that each label y i ∈ R c is a vector of c elements, equal to the number of classes, has every elements equal to 0 except the one associated with the class of example x i which is equal to 1. Furthermore, for each example, we associate the features extracted from the network which are given by Z = {z i |i = 1, 2, · · · , N} and by C = {c i |i = 1, 2, · · · , c} the representatives of the feature representation of each class. In addition to the representatives for each class, we also define Let L be the number of layers in the base CNN architecture, i.e., the backbone network. Highly successful architectures which can be utilized as base networks include the ResNet [19] and the Inception [20] architectures. In all these reference architectures, the network consists of L − 1 convolutional blocks while the L th layer is a fully connected (or dense) layer with elements equal to the number of classes. For the specific case of the ResNet, the network consists of a series of convolutional blocks where the input goes through two paths. One path simply propagates the input, while the other consists of a convolution, a batch normalization, a ReLU activation, another convolution and a batch normalization. At the end of the block, these two path are added together and the output goes though a last ReLU activation. For the case of the Inception, each convolutional block consists of multiple independent paths where are all concatenated to product the block output. Exemplary paths include cascade of convolution with different kernel sizes, e.g., 1x1 followed by a 3x3 and then by a another 3x3, or max pooling layer followed by a 1x1 convolution.
For both ResNet and Inception, as well as many other state-of-the-art methods, once the convolutional blocks are calculated, the output is flattened and passed to a fully connected layer where the final layer nodes equal the number of disc int classes. The proposed architecture design, adds an additional fully connected layerL after the (L − 1) th convolutional block, such that the network is now equipped with two outputs, L and L. While the L output of the network is the same as with other applications and is responsible for producing class probabilities, the L is responsible for the feature embedding process which is utilized by the proposed CNN-SFL in order to estimate the required distances to class representatives.

CNN-SFL loss function
The objective of the proposed scheme is to minimize the composite loss function L given by where L 1 , L 2 and L 3 are the three components of the proposed loss functions, while λ 1 and λ 2 are scaling factors controlling the importance of each component. The first component of the loss function, L 1 , is the typical cross-entropy between predicted and truth class and is given by which is minimized when L(x i ) = y i , i.e., when the predictions, encoded in a one-hot encoding, thus representing probabilities associated with each class, are in agreement with the ones in the ground truth annotations. The other two components of the composite loss functions are responsible for encoding the distance between examples and the representatives from the same class for the case of L 2 and from a different class in L 3 . Formally, for each example we define the positive representative as the representative associated with the class that the example belongs to, and as negative representative, the representative that belongs to a different class. As such, the requirement for L 2 is to minimize the Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 August 2020 doi:10.20944/preprints202008.0113.v1 distance between each example and its positive representative, and for L 3 to maximize the distance between each example and its negative representative.
In order to enforce a small distance between the feature representation of a given example i and the corresponding representation of the same class representative, the positive representative, the L 2 loss is given by where c i is the representative of the features belonging to the same class with example i. The additional one is introduced so that the minimum of the loss function approaches zero as the feature representations and the class representatives come closer. Correspondingly, the L 3 loss is minimized when the feature representation of an example and the feature representations of the representative from a different class is maximized and is given by where c =i is the closest feature-space representative belonging to a different than the one of example i. The parameter τ is called the temperature and controlling the impact that the distance has on the value of the loss function term. In both cases, the similarities between feature space examples and the corresponding same and different class representatives are modeled through a radial basis function kernel in order to exploit the properties that the output values of this comparing two vectors is bounded between 0 and 1.

Class representative selection
In order to select a representative for each class in the feature space, one can consider two scenarios. In the first scenario, the representative corresponds to a "novel" example which is generated by taking the mean of the same class examples in the feature space and is given by: where N c is the number of examples which belong to class c. Another option is to select one of the training examples as the representative by solving the following minimization problem: In addition to the selection of the representative for each class, equally important is the selection of the corresponding negative representative, which is used in Eq. 4. Again, two scenarios can be considered, depending on whether the same negative representative is selected for all the examples in the class, or independent for each example. In the first scenario, the negative representative is the same for all the examples in the same class and corresponds to the representative that is closest to the positive representative for these examples, but from another class. Specifically, once the class representatives are estimated, the negative representative as found by solving: In the second scenario, the negative representative is estimate for each example independently. The benefit associated with the scenario is that the L 3 loss is adjusted for each particular example, Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 August 2020 doi:10.20944/preprints202008.0113.v1 handling the case that some example from a certain class may be closer to a particular negative representative while others may be closes to another negative representative. In this case, the class representatives are estimated by solving: In this work, we consider the scenarios where class representatives are estimated and not selected, i.e, we consider Eq. 5 and Eq. 7. We opted for this approach since in the feature space, the estimated representatives capture more realistically the class centroid and are independent of the particular characteristics of the training set.

Connection to distance metric learning
In the distance metric learning framework, the objective is to estimate a matrix M such that the distance between examples from the same class x i , x j is smaller compared to the distance between examples from different classes x i , x k where the distance is computed as the Mahalanobis distance given by In order to the metric to be a proper distance, the matrix M must be positive semi-definite, i.e., all singular values must be XXX. When the matrix entries are real-values, matrix M can be decomposed using the XX decomposition such that M = QQ T where Q is a XX matrix [21]. Given this decomposition, the distance in Eq. 9 can be equivalently expressed as In the proposed CNN-SFL, effectively, the feature mapping layerL perform an equivalent function as the matrix Q, albeit employing a non-linear mapping, instead of the linear mapping case when Q is employed. As a results, the new distance considered by the CNN-SLF is given by where K is the induced kernel associated with the radial basis function used in Eq. 3 and Eq. 4.

Comparison with existing approaches
Our work diverges from the approaches proposed in [9] in several ways. First, the proposed scheme can be regarded as an instance of multi-task learning where both a one-hot encoded vector and a feature vector are produced by the network. As such, it offers the flexibility of introducing another distinct path towards the feature space, not necessarily constrained by the given architecture. Second, we extend the idea of imposing constraints on the distances, but instead of considering the distances between examples within a batch, we consider distances between class representatives in feature space, extracted across batches, thus providing greater generalization capabilities. Last, unlike different variance of Deep Metric Learning where the objective is to maximize the similarity between examples from the same class and is thus applicable to retrieval-type settings, the proposed scheme can naturally extent state-of-the-art cross-entropy based classification approaches with the distance learning functionality, thus offering the best of both worlds.   for j = 1, 2, . . . , C do 7: Collect {z k | y k = j} and estimate c j using Eq. 5 or Eq. 6. 8: end for 9: for j = 1, 2, . . . , N do 10: Estimate c =i using Eq. 7 or Eq. 8.

Datasets and evaluation metrics
In order to understand the behavior of the proposed CNN-SFE scheme, an extensive set of experiments is considered on three representative datasets of aerial imagery, namely the UC Merced [22], the AID [23] and the NWPU-RESISC45 [2].
The UC Merced dataset 1 contains aerial color (RGB) images of size 256 × 256 with spatial resolution 0.3 m from 21 land-use classes including agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis courts. For each class, the datasets provides a set of 100 images.
The Aerial Image Dataset (AID) 2 is a large scale collection of 10000 600 × 600 pixel color images with resolution between 0.5 to 8 m. The images are examples from 30 classes including airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks, and viaduct. The number of images for each class varies between 220 and 420. The NWPU-RESISC45 dataset 3 contains 31500 256 × 256 pixel color images at spatial resolution varying between 0.2 to 30 meters. Images at classified into 45 different classes such that each class contains 100 images, while the classes include airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbor, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, mountain, overpass, palace, parking lot, railway, railway station, rectangular farmland, river, roundabout, runway, sea ice, ship, snowberg, sparse residential, stadium, storage tank, tennis court, terrace, thermal power station, and wetland.
In the following subsections we investigate the impact that different design choices have on the system's performance. To that end, we consider images from the AID dataset and a realistic scenario where only 5% of the dataset is available for training, in order to better emphasize the impact of different parameters on the network.

Impact of feature dimensions
A first aspect we explore regarding the proposed CNN-SLF model is the quantification of the impact that different sizes of the feature space representation have in term of prediction accuracy. Figure 2 presents the accuracy on the validation set achieved by the baseline ResNet101 model, and the CNN-SFL with ResNet101 as the core network and feature space dimensions equal to 32, 64 and 128.
An overall comment from the results presented in Figure 2 is that the introduction of the additional feature representation module does not reduce the classification accuracy of the standard core model. Examining the different dimensions of the feature space, the results indicate that the 128-dimensional space can have a significant impact in performance, offering gains of more than 3% compared to the baseline mode.

Impact of parameter selection
In order to get a better understanding of the performance of the proposed scheme, we explore the impact of loss function component weighting and impact of temperature parameters. Specifically, in Table 1, we explore the impact of weighting between the L 1 and L 2 given by parameters λ 1 and λ 2 , as well as the temperature parameter τ.
Regarding the impact of λ 1 and λ 2 , an interesting observation is that while their impact in terms of training error is minimal, the performance gains are more pronounced for the testing/validation error. Furthermore, utilizing two loss function components, for intra and inter class similarities, leads to the best performance. Regarding the temperature parameter τ, evidence suggest that their is impact in terms of performance when this value is utilized. In order to gain a deeper understanding on the behavior of the CNN-SLF, Figure 3 visualizes the feature-based class representatives in a matrix format and the associated singular values distribution for this matrix. Ideally, one would like class representatives to be as distinct as possible, which in more mathematical way, can be revealed by quantifying the linear dependencies between them. The matrix visualization, but more importantly, the singular values shown in Figure 3 indicate that class representations from different representatives are relatively orthogonal in the 128-dimensional feature space. As a result, mapping a new test example on this feature space will lead to a representation which is much closer (in terms of projection magnitude) to the a given class representative compared to all others.

Visualization in low-dimensional spaces
To better understand the impact of the feature embedding module, we employ the t-SNE manifold learning method [24] in order to project the extracted features into 2D spaces and allow the visualization of the outputs. For the case of the baseline network, the t-SNE maps the 2048-dimensional outputs from the first fully connected layer, which is typically considered as the feature representation, while for the CNN-SFL, the mapping is from the 128-dimensional space to the 2D space for visualization. The resulting scatter plots are shown in Figure 4, the top part for the baseline model and the bottom for the proposed CNN-SLF. In addition to visualizing the extracted features, for the case of the CNN-SLF, we also plot the 128-dimensional feature representations for the class representatives (black circles).
Examining the scatter plots in Figure 4, we can clearly see that the separation between different classes achieved by the CNN-SFL is significantly between compared to the baseline model. While for the baseline model, examples from different classes are highly intermingled, for the case of the CNN-SFL, the classes are clearly separated owning to the introduction of the L 2 loss function (points that are not properly clustered correspond to misclassifications). However, even for the case of samples falling away from their corresponding class region, we observe there are cases where such samples are still closer to their own class region, compared to other class regions. Furthermore, the class representatives are, as expected, centered on the region covered by examples from the corresponding class, indicating the L 1 loss indeed acts as an attractor.

Distance based class separation
In addition to the classification of imaging using the cross-entropy loss as it is traditionally done, the proposed CNN-SFL framework is also capable of estimating distances between examples and class representatives. To quantify this behavior, Figure 5 presents histograms of the distance between samples and class representatives from the same and more different classes. Figure 5(a) present the histogram for AID with 5% training and and Figure 5(b) for UCMD with 80% training data.
In both cases, we observe that distances between representatives from the correct class are in general smaller compared to representatives from the wrong classes. For the case of AID, the two corresponding distributions have some overlap which is a result of misclassification, while for the case of the UCMD which achieves an almost perfect accuracy, there is no observable overlap between the two distributions.
Another observation we can make from these plots is that ideally, all the distances with the correct class representatives should be zero and all the distances with the incorrect classes should have the maximum value. However, the distributions are not delta functions but have significant volume. Even for the case where almost all examples are correctly classified, there is a portion of samples that have a non-zero distance to the correct class representative. This distance can be utilized as a proxy to the confidence associated with each prediction where smaller distances indicate a more reliable prediction. Furthermore, we imposing a threshold between the modes of the same and different class distance distributions, we can allow the network to output "unknown class" responses, unlike traditional cross-entropy-driven CNNs which will always predict a class, even if the particular examples does not belong to the available ones.

Characteristics of misclassifications
Figures 6, 7 and 8 present the confusion matrices for the CNN-SFL predictions on the AID, the UCMD and the RESISC45 datasets respectively. For the case of AID, the system incorrectly classifies resorts as parks, which is a reasonable mistake given that both cases are characterized by a significant amount of grassland, or in a similar spirit, confusing desert with bare land. This is also observed for the case of the UCMD dataset, where misclassifications correspond to confusing medium residential with dense residential and sparse residential. The same observation also applied to the RESISC45 dataset

Comparison with state-of-the-art
In order to gain a better understanding of the benefits offered by the proposed CNN-SFL method, we provide a detailed comparison with current state-of-the-art methods in the benchmark dataset. Specifically, Table 2 presents the performance for the UCMD dataset, Table 3 for the AID dataset and  Table 4 for the RESISC45 dataset. In all cases, we consider two training set ratio, following the reported results from previous works. In addition to our method, we also consider three highly successfully (a) AID with 5% training data.
Overall, in almost all datasets and training set ratio settings, the experimental results indicate that the proposed CNN-SFL achieves top performance, significantly surpassing generic CNN approaches and the recently presented scene-classification-focused methods. In fact in certain cases, like the UCMD dataset, the accuracy is almost perfect, so that any extremely small error can be attributed to human labelling ambiguities, i.e., medium versus dense residential. Furthermore, we observe that the performance gains are more significant in the low-training-ratio regimes, indicating that the proposed methods is able to handle more efficiently limited training sets. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 August 2020 doi:10.20944/preprints202008.0113.v1 Figure 6. Confusion matrix for the AID with 50% training data. Table 2. Classification accuracy for the UCMD dataset.

Conclusions
In this work, we propose the CNN-SFL, a universally applicable extension of traditional CNN architectures which achieves very low class prediction error by incorporating both the traditional categorical cross-entropy loss with a novel feature based distance learning loss. A detailed experimental validation for remote sensing scene classification demonstrates that the proposed scheme can surpass state-of-the-art methods, while additional information related to distance to class representative can  Table 3. Classification accuracy for the AID dataset.