Comparison of CNNs and ViTs Based Hybrid Models Using Gradient Proﬁle Loss for Classiﬁcation of Oil Spills in SAR Images

: Oil spillage over a sea or ocean’s surface is a threat to marine and coastal ecosystems. Spaceborne 1 synthetic aperture radar (SAR) data has been used efﬁciently for the detection of oil spills due to its 2 operational capability in all-day all-weather conditions. The problem is often modeled as a semantic 3 segmentation task. The images need to be segmented into multiple regions of interest such as sea surface, 4 oil spill, look-alikes, ships and land. Training of a classiﬁer for this task is particularly challenging since 5 there is an inherent class imbalance. In this work, we train a convolutional neural network (CNN) with 6 multiple feature extractors for pixel-wise classiﬁcation; and introduce to use a new loss function, namely 7 ‘gradient proﬁle’ (GP) loss, which is in fact the constituent of the more generic Spatial Proﬁle loss proposed 8 for image translation problems. For the purpose of training, testing and performance evaluation, we use a 9 publicly available dataset with selected oil spill events veriﬁed by the European Maritime Safety Agency 10 (EMSA). The results obtained show that the proposed CNN trained with a combination of GP, Jaccard and 11 focal loss functions can detect oil spills with an intersection over union (IoU) value of 63.95%. The IoU 12 value for sea surface, look-alikes, ships and land class is 96.00%, 60.87%, 74.61% and 96.80%, respectively. 13 The mean intersection over union (mIoU) value for all the classes is 78.45%, which accounts for a 13% 14 improvement over the state of the art for this dataset. Moreover, we provide extensive ablation on different 15 Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) based hybrid models to demonstrate 16 the effectiveness of adding GP loss as an additional loss function for training. Results show that GP loss 17 signiﬁcantly improves the mIoU and F 1 scores for CNNs as well as ViTs based hybrid models. GP loss turns 18 out to be a promising loss function in the context of deep learning with SAR images. 19


22
Oil spills are one of the major causes of sea oil pollution and it poses a significant threat 23 to the marine and coastal ecosystems. Ship accidents, bilge dumping and offshore oil platforms 24 are the main sources of sea oil pollution [1]. Since the last few decades, spaceborne synthetic 25 aperture radar (SAR) has been widely used for the detection and classification of oil spills and 26 look-alikes. Oil on a sea surface can generally be seen as a dark stretch in SAR images because it 27 dampens the capillary waves and reduces the backscatter [2]. Nevertheless, dark stretches can 28 also occur as a result of natural phenomena such as low wind areas, algae blooms, grease ice, etc. 29 [1], [3]. They are generally called look-alikes. These look-alikes add to the complexity of the 30 classification problem. Even a visual inspection may not suffice to separate an oil spill from a 31 look-alike, and an automated algorithm can similarly mistake a look-alike for an oil spill and vice 32 versa. 33 In this context, deep learning may prove useful. For example, semantic segmentation with 34 deep convolutional neural networks (DCNNs) can be used to assign a class label to every pixel in 35 the complex feature from a large amount of data and extract information in a hierarchical manner, 37 resulting in striking success in the field of remote sensing and geo-spatial analysis [4]. Unlike 38 object-based detection methods, semantic segmentation can delimit the boundaries and position 39 of the target of interest accurately, which renders it suitable for processing remotely sensed data 40 [5], [6]. The swath of typical SAR images over a sea may include contextual information such 41 as part of the coastline (land), ship(s), natural sea surface and look-alike(s) besides oil spill 42 itself [5]. Therefore, in the context of identification of oil spills, a multi-class classification 43 framework is needed. There are numerous classification models based on semantic segmentation 44 including U-Net [7], [8][9][10] and DeepLab series [11], which have been used for the detection 45 and classification of oil spills. In spite of this, the oil spill detection and its discrimination from 46 look-alikes remains a challenging problem, especially when multiple classes have to be trained 47 and tested. 48 Recently, the authors in [12] proposed a family of Convolutional Neural Networks (CNNs), 49 termed as EfficientNetV2. Usually, the training of CNNs require high powered computational 50 resources such as GPUs. EfficientNetV2 family has fewer trainable parameters which significantly 51 reduces the training time. We intend to use EfficientNetV2 for semantic segmentation based 52 multiclass classification of SAR images and to highlight the choice of GP loss as a promising 53 loss function for training CNNs. In addition, the authors in [13] proposed self-attention models, 54 i.e., Transformers for language processing applications [14,15]. As compared to CNNs, the 55 Transformers have large model capacity. However, their generalization capability is worse. 56 After the development of Transformers, several attempts have been made to use the power of 57 self-attention for different computer vision tasks [16][17][18]. With increasing interest in Vision 58 Transformers (ViTs), the authors in [19] considered the advantages of both CNNs and ViTs to 59 propose a new family of hybrid models. These models are termed as CMTs: Convolutional Neural The advantage of utilizing CNNs over traditional approaches is that they can be trained 71 end-to-end and learn the input-output mapping from examples [21]. This end-to-end training 72 will simplify the task and reduce the human effort to define critical thresholds and parameters. 73 Topouzelis et al. [22] utilized two neural networks (shallow and deep) for classification of 74 potential oil spills from look-alikes. Same framework has been utilized in various later studies 75 with SAR imagery [23], [24]. The authors in [25] proposed a method for oil spill detection and 76 classification based on SegNet [26], which is a deep convolutional neural network for semantic 77 segmentation. The model is applied to SAR images with pre-confirmed oil spill. The model 78 performs well under high clutter conditions. However, the model is also based on and limited 79 to classification of SAR images into two classess i.e. oil spill and look-alikes. The authors in 80 [27] proposed a deep DCNN for semantic segmentation of SAR images into multiple regions 81 of interest. The deployed model was trained on a publicly available oil spill dataset [28]. An instance-based segmentation model, namely mask region-based convolutional neural network 83 (Mask R-CNN) is proposed for the detection and segmentation of oil spills and look-alikes in [29]. 84 The results conclude that the instance-based segmentation model outperforms traditional deep 85 learning models. Krestenitis et al. [30] proposed a deep DCNN based on architecture of DeepLab 86 [11] for semantic segmentation of SAR images into regions of interest such as sea surface, oil 87 spills, look-alikes, ships and land. The deep learning model was trained on manually annotated 88 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 October 2021 SAR images. The authors in [28] provided a comparison of existing CNNs based on semantic 89 segmentation for detection of oil spills and look-alikes. 90 Recently, the oil spill detection dataset developed by authors in [28] has been used in several 91 studies regarding oil spill classification. The authors in [31] developed a two-stage deep learning 92 framework for classification of potential oil spills. The first stage is a 23 layer CNN that classifies 93 the patches based on the percentage of oil spill class pixels. The second stage is a UNet CNN for 94 semantic segmentation of SAR images. Moreover, they used generalized Dice loss for training 95 and evaluated their results on test dataset using Dice score. The authors in [32] proposed a feature 96 merge network (FMNet) for semantic segmentation of SAR images. Initially, they utilized a 97 threshold method to extract global features from SAR images. After that, the results from the 98 initial step are used to extract high dimensional features. In the final step, the extracted features 99 are combined with the high dimensional features of the original SAR image. In [33], the authors a large class imbalance [34]. Typically for oil spill problems, and remote sensing applications 108 in general, the desired class may have fewer samples by several orders of magnitude than other 109 classe(s). To address this concern, CE loss can be tailored to give priority to class(es) with fewer 110 samples. However, it can result in noise amplification [35]. Focal loss can be considered an are currently lacking. Moreover, for training CNNs, the loss function that considers spatial 122 relationship over semantically constant regions is not studied to the best of our knowledge.

123
In this paper, we investigate the performance of different CNNs and ViTs based hybrid 124 architectures for semantic segmentation of SAR images into multiple relevant classes, i.e., sea 125 surface, oil spill, look-alikes, ship and land. Moreover, we introduce the use of a new loss function 126 termed as GP loss, which is in fact the constituent of the more generic Spatial Profile loss proposed 127 for image translation problems [36]. It computes similarity in gradient space between ground 128 truth and predicted class labels by considering rows and columns as spatial profiles, respectively.    Fig. 1. We use this dataset not only for training the classifiers, but also as a benchmark 152 to compare our results against those published by the developers in [28]. 153

154
The proposed methodology for oil spill detection is based on semantic segmentation of SAR 155 images. Due to irregularity in oil slicks shape and texture, a single label for entire image is not 156 sufficient to detect potential oil spills. Similarly, other approaches like object based detection [39] 157 and assigning multiple labels to single image [40] do not perform well in oil spill detection case.

158
In contrast, semantic segmentation classifies the multiple classes of interest in a single image at 159 pixel-level, making it suitable for complex problems like oil spill detection and classification [5], 160 [6]. is also used in many remote sensing applications [8][9][10]. It consists of an encoder (contracting 164 path) and decoder (expansive path) part as shown in Fig. 2. The encoder has a similar structure to 165 a typical CNN. It consists of two 3×3 convolutional layers, each followed by a rectified linear 166 unit (ReLU) and a maximum pooling layer with kernel size 2×2 and stride 2. At the end of each 167 encoder block, the number of feature channels are doubled to learn complex low level features.

168
The decoder consists of up-sampling and concatenate layers, followed by two 3×3 convolutional 169 layers, rectified linear unit (ReLU) and a maximum pooling layer with kernel size 2×2 and 170 stride 2. Finally, a 1×1 convolution is used to map the feature channels to the desired number 171 of classes. The encoder part reduces the spatial dimensions of input SAR image and increases

198
CMTs: Convolutional Neural Networks Meet Vision Transformers are a new family of hybrid models proposed by Guo et. al. [19]. It has a CMT stem which consists of a single 3x3 convolutional layer with stride 2x2 and two 3x3 convolutional layers with stride 1x1. The rest of the network is made of alternate 3x3 convolutional layers with stride 2x2 and CMT blocks as shown in Fig. 4. Each CMT block consists of a Local Perception Unit (LPU), Lightweight Multi-Head Self-Attention (LMHSA) module and an Inverted Residual Feed-Forward Network (IRFFN). LPU extracts the local information and is defined as follows: where X ∈ R H×W ×d , H ×W represents the dimensions of the input image at current stage and d represents the dimensions of the features. DWConv(.) is depth-wise convolution. For details about LMHSA and IRFNN modules, the readers are referred to [19]. Combining the aforementioned modules, the CMT block can be defined as follows: where X i and X i are outputs from the LPU and LHMSA modules for a block i, respectively.  CoAtNets are a family of hybrid models, recently proposed by authors in [20]. CoAtNets 206 are built with two key insights which are as follows:    . An overview of the CoAtNet architecture used for semantic segmentation of SAR images for oil spill classification. It has five stages, viz., S0, S1, S2, S3 and S4. Each stage reduces the dimensions of the input image by a factor of 1/2. For our classification problem, the input is a 320x320x3 SAR image and output is a 320x320x5 semantic segmentation mask with five desired classes.

Categorical cross entropy loss 240
The cross entropy is a measure of the difference between two probability distributions. Considering the case of binary classification, the cross entropy loss is expressed as follows [43]: where y ∈ {±1} is the ground truth class and p ∈ [0, 1] is the probability of predicted true class, respectively. In the context of multi-class classifications, this loss is referred to as the categorical cross entropy loss. It measures the performance of a classification model by comparing probability distributions of ground truths and predicted class labels. If we define a new variable p t : then eq. (3) can be rewritten as CE(p t ) = − log(p t ). This loss function helps in addressing the data imbalance problem. The hard examples tend to increase the classification error. Training a CNN with categorical focal loss encourages the model to pay more attention to these examples, resulting in improved classification performance. It prevents large number of false negatives from saturating the CNN during the training phase. Mathematically, the focal loss is defined by adding the modulating factor (1 − p t ) γ to the cross entropy loss [43]: where α and γ are the hyperparameters of focal loss. Jaccard index is one of the most commonly used metric for semantic segmentation based classification problems. It measures the similarity between ground truth mask and predicted class labels. Considering y to be the ground truth mask andŷ as the predicted class labels, the Jaccard loss function can be computed as follows [44]: where ε is used to prevent division by zero. The subtrahend is equivalent to the intersection over 245 union (IoU) value. Therefore, the use of Jaccard loss for the training aims to directly increase the 246 IoU (which itself is a commonly used figure of merit for classification performance). Common cross-entropy based losses used in semantic segmentation focus on classifying each pixel individually and do not take into account the spatial relationship over semantically constant regions. To some extent the use of IoU based loss (Jaccard) caters for this since it tries to increase the intersection over union of final predictions over a region. In order to illustrate this point, Fig. 6 shows three images, i.e., source A (left), target B (center) and C (right). The target B and C have same number of white pixels but their spatial structure is different. First, we compute the mean absolute difference (D pixel ) between source A and each of the target B and C by considering each pixel independently. As a result, we get the same value of 0.3750 for both targets. This method does not capture the different spatial patterns of target B and C. Towards this end, the complex spatial patterns in an image can be better captured by considering pixel variations along a given direction. To demonstrate this, we consider the columns of an image as vectors and compute the Euclidean distance between source A and each of the target B and C. The mean of these distances (D GP ) between A and B is 10.9545. Similarly, the mean of distances between A and C is 6.7082. By considering columns or rows of an image as spatial profiles, we can accurately capture the complex spatial patterns. With this motivation, we introduce to use an additional loss that is computed in a way which preserves the spatial structure of the target label map over the entire image -in contrast to -over regions or pixels. This is achieved by matching prediction probabilities along horizontal and vertical directions in the output segmentation maps. The whole row or column aka profile of the output prediction map is considered as a vector and matched in vector space by computing cosine similarity. This is inspired from the recently proposed spatial profile loss (SPL) [36] for use in image translation tasks. SPL computes such similarities on different color spaces and gradient spaces of the image. Our contribution is to incorporate such a matching on prediction probabilities in a semantic segmentation task. Since we are matching probability distribution along profiles, we compute this similarity over the gradients of prediction class maps. Formally, the similarity over each image channel is measured as follows: where y represents the ground truth mask of size H ×W ,ŷ represents the predicted class labels of the same dimension, tr(.) represents trace of a matrix and (.) τ represents transpose of a matrix, and the subscript c indicates a column of the matrix. The first and second term computes similarity between row and column profiles of ground truth mask and the predicated class labels, respectively. We compute the loss given in eq. 8 in the image gradients' space, and call it the Gradient Profile (GP) loss [36]: The image gradients for each channel of an image can be easily computed by measuring image 249 difference between an image and its 1-pixel shifted version.

251
The training of UNet CNN is conducted with different backbones. In particular, we have used  We evaluate the performance of the classification in terms of the intersection over union 258 (IoU). The results are compared against those from the earlier work [28] as reproduced in Table. Table.2. When just cross entropy loss was used in [28] with restnet101 269 backbone, the mIoU achieved was merely 64.97% (row 1, Table. 1). If we use a combination of 270 categorical focal and jaccard loss, the mIoU score jumps to 76.52% (row 3, Table.2). Moreover, 271 even resnet50 with 19 million fewer trainable parameters compared to resnet101 performs 272 better with this combination (referred to row 1, Table. 2). And remarkably, addition of GP loss 273 further improves the overall classification performance in terms of both mIoU and F 1 scores, for 274 Table 1. Comparison of classification results with state of the art (as reported by the earlier work [22]) assessed over the test SAR images in terms of the intersection over union (IoU) score.      oil spill, which is an imbalanced class) as well as the overall performance.

348
It is noteworthy to mention that the deep learning has been performed on a rather small 349 training set with a large class imbalance. It is probable that an increased dataset may help in 350 furthering the scores, though decent results (with F 1 > 80) are achieved already. We thank the 351 researchers who set up this dataset [28,30]; and for our future work, we aim to further improve