2.1. Initial Data
For this research, we adopt the data collection methodology outlined in [
3]. Our dataset comprises samples obtained during four research expeditions conducted in the Atlantic and Arctic oceans. These expeditions were undertaken by the Shirshov Institute of Oceanology of the Russian Academy of Sciences within the governmental program of regular ocean observations.
The routes of the expeditions encompass points with SeaVision radar images and/or Spotter buoys measurements [
8]. Comprehensive details about the research expeditions can be found in
Table 1. The selection of locations for sea wave observation was based on local weather conditions and temporal constraints. As
Table 1 illustrates, there are points without available Spotter buoy data. Consequently, for further study, we only utilize specific locations where simultaneous wind wave Spotter observations and SeaVision images were conducted. We denote such selected points by <<stations>>.
Spotter wave buoy provides highly accurate measurements of wind wave characteristics collecting the training dataset of SWH. SeaVision radar creates one sea clutter image every two seconds.
The center of the resulting sea clutter images coincides with the location of the radar antenna. Raw images with the spatial resolution of 1.875 m cover >7 km radius around the ship. The image exhibits a signal with diverse structures and intensities attributed to local wind direction, ship rotation, and reflections of electromagnetic signals from the rough ocean surface (e.g., [
9]).
We suppose that each radar image includes an area with the most meaningful signal information about wave field. Additionally, there is a "blind" zone near the center due to signal reflection from the ship. To mitigate this effect, we exclude the part of the image within a 300-meter radius around the ship. Given that the reflected signal becomes less distinct with increasing distance from the radar antenna, we assume that a radius of ≈2000 meters ensures the highest significant variability observed in radar data. Consequently, first, we restrict our analysis to the 300–2000 meter range for further processing.
The important step of the data pre-processing is a choice of the optimal 180° sector containing the most informative wave signal. The technique used in this research extracts the most contrasting area from the entire radar image to provide the most clearly identified wind waves. For every station, the sector with the largest temporal standard deviation is the optimal sector. In this study, considering spatial resolution of 1.875 m, we choose the outer radius of the meaningful area as ≈1920 m (1024 px). Hence, we work with pre-processed
images in contradistinction to
in [
3].
We linearly transform the values of the back-scattered radar signal image so that for every specific image, the minimum value is 0 and the maximum is 255. For further purposes, we use only masked real
images. Namely, we replace the pixels outside 300–1920 meter radius area with zeros. The example of the pre-processed radar image is shown in
Figure 1.
2.2. Synthesis of Realistic Sea Surface
We develop the methodology of generating realistic synthetic radar images [
5] based on realistic ocean scenes [
6]. The authors of [
6] elaborated a technique that presents fully developed sea with an empirical modified Pierson-Moskowitz sea power spectrum [
10].
As proposed in [
6], we first generate a
array with noise values uniformly distributed from
to
. We choose the image size that is bigger than 2048 to further exclude possible edge effects. The result of this step is a white-noise image (see
Figure 2a).
We also assume that the spatial resolution is 1.875 m, equal to that of the real radar images. This parameter allows us to perform a two-dimensional forward Fourier transform of the white-noise image to generate an array of complex numbers. The magnitude of these complex Fourier components is shown in
Figure 2b. The coordinate axes in
Figure 2b are spatial frequencies
and
[
].
In this research, we consider a simple case of fully developed wind waves, characterized by a wave power spectrum constant in time. Under these assumption, W. Pierson and L. Moskowitz [
10] empirically approximated the mathematical form of the downwind power spectrum
:
where
f is the temporal frequency [Hz],
is the peak temporal frequency [Hz],
is the Phillips constant, and
g is the mean gravitational acceleration [m/
].
The form of the fully-developed wind spectrum (
1) depends solely on a single parameter
that indicates the frequency of the spectrum maximum. It was also discovered in [
10] that
is a function of 10 m surface wind speed
:
SWH is usually defined as an average measurement of the largest 33% of waves [
11]:
, where
N is the total number of measured waves,
is a height of the
j-th wave from the largest 33%. For a given power spectrum
, we can equivalently calculate significant wave height [
12]:
. Thus, from (
2), for Pierson-Moskowitz spectrum (
1), we obtain:
Hence, we completely determine the shape of the one-dimensional downwind fully developed wave spectrum (
1) with either
or SWH (
3).
The two-dimensional wave power spectrum
was proposed in [
13] as an extension of (
1) taking the wind direction into account:
where
is one-dimensional Pierson-Moskowitz spectrum from (
1) and
is a normalized directional multiplier at angle
from the downwind direction. The empirical parameters in (
5) are
and
that is equal to 4.06 for
and –2.34 for
.
The example of the normalized filter
for
=15 m/s for the frequency domain of
Figure 2b is shown in
Figure 3a. To transform the white-noise spectrum into the two-dimensional Pierson-Moskowitz spectrum (
4), we multiply the magnitudes of the Fourier components of the initial white-noise image by (
1) and (
5).
The resulting spectrum creates a narrower profile near
(
2) in the downwind direction, forms a bimodal spectrum shape for
from the downwind direction, and suppresses the long-crested peak frequency components, while retaining non-peak frequencies [
6]. The filtered magnitudes of the white-noise Fourier components from
Figure 2b are shown in
Figure 3b.
Subsequently, we combine the filtered magnitudes with the original phase of the white-noise complex Fourier components, and apply the inverse Fourier transform, as described in [
6]. The result of this procedure is a
array of complex numbers. The desired synthetic sea surface is ar eal part of this array. We illustrate the full-domain realistic sea surface and its central part in
Figure 4.
2.3. Synthesis of Radar Images
Here we describe transformation of the synthetic sea surfaces obtained in Sub
Section 2.2 into synthetic X-band radar images.
Two main geometrical effects that influence the X-band signal reflected from the sea surface are (1) shadowing and (2) tilt modulation [
5]. We further follow the transformation technique proposed by [
14] with the subsequent modifications [
5].
Briefly, in this study, the shadowing effect refers to the geometrical optics approximation. In certain areas, the sea surface obstructs the reflection of radar rays from adjacent areas, leading to the shadowing of nearby waves. Consequently, the radar antenna receives no meaningful signal from the shadowed parts of the sea surface [
5]. Obviously, this phenomenon depends on the grazing angle that is determined by the relationship between the radar antenna height
and the distance to the antenna
x (
Figure 5).
For tilt modulation, the steepness of the observed surface slope affects the power amplitude received by the radar antenna [
5]. Thus, for non-shadowed areas the received back-scattered signal is proportional to
, where
is the angle between the radar ray
and the vector
normal to the wave surface (
Figure 5).
Summarizing, we present the following algorithm that transforms the synthetic sea surface into the realistic X-band radar image, For every synthetic sea surface:
Distance mask: we determine the masked points – the points with the distance to the radar more than 1920 m or less than 300 m;
Shadowing: we determine the points that are not available for a radar ray;
Tilt modulation: for every non-shadowed point we determine the angle between the radar ray and the vector normal to the wave surface;
For every non-shadowed point the amplitude of the back-scattered signal is ;
We compute the minimum and the maximum values of the back-scattered amplitude among the non-shadowed points;
Normalization: we linearly transform the amplitude values so that the new minimum value is 0 and the maximum value is 255;
Noise: for shadowed and masked points we set the amplitude values as random numbers distributed uniformly from 0 to 255.
We choose the half of the array that corresponds to the downwind direction.
The resulting synthetic X-band radar image with
size is shown in
Figure 6.
2.5. Data Pre-Processing
In this research, we apply convolutional neural networks (CNNs). Consistent with standard practice, we linearly normalized both input data and the target SWH values, adjusting them to approximately have a zero mean and a variance equal to one. Namely, for every pixel
x of the real and synthetic radar images, the normalized radar image pixel
. Nevertheless, after normalization, for the masked areas of the image (white areas in
Figure 6a), we set the zero value. The normalized target value is
.
CNNs lack inherent rotation invariance, necessitating diverse training data with varied spatial feature orientations to achieve rotation invariance. In this study, we assume that the orientation of spatial features may not exhibit high diversity. Consequently, we employ two-dimensional data augmentation fir the real SeaVision and synthetic radar semicircles, namely, random rotation with an angle ranging from -5° to 5°. This approach encourages the CNN-based model to acquire rotation invariance through training on augmented data, thereby enhancing the generalization ability of the CNN.
2.6. ANN Models
In this study, we apply convolutional neural networks (CNNs), i.e., parametric mappings where the model parameters are optimized through sequential application of a fixed-size convolutional kernel to two-dimensional input data [
15].
The high depth of CNNs is anticipated to enhance the predictive output quality. However, the large number of layers introduces training instability in the back-propagation algorithm, leading to learning inefficiency due to the <<vanishing gradients>> effect. This effect results from the accumulation of excessively small gradients for model parameters. Consequently, the product of the gradient vector and the learning rate coefficient tends toward zero, causing the parameters to remain constant during each optimization step.
An effective approach to address the issue of learning instability involves the incorporation of connections that bypass the intermediate layers of the model. These skip connections serve to diminish the likelihood of accumulating small gradients. A notable example is the implementation of residual connections, wherein the output of an intermediate layer is added to the output of a subsequent level.
In this paper, we build two CNN architectures based on the model for processing SeaVision radar images as it is described in [
16]. The basic architecture is the modified ResNet50 combining the advantages of deep CNNs and residual connections.
The convolutional core of the modification adheres to the original ResNet architecture, with the addition of sinusoidal positional encoding of various wavelengths to enable the CNN to capture the wavelengths peculiar to real SeaVision images [
16]. Two-dimensional positional encoding introduces additional channels with generated harmonic maps featuring various wavelengths and directions. Specifically, there are cosine- and sine-based positional encoding channels that vary in both horizontal and vertical directions. These maps are concatenated with the output of the ResNet blocks so that the subsequent ResNet block processes activation maps from the previous layers along with the positional encoding channels. It is worth noting that for modified ResNet50, positional encoding maps are injected into the activation maps after each ResNet building block.
2.6.1. Unsupervised Reconstruction Model
The first architecture elaborated for this research is the reconstruction CNN model with the structure similar to U-Net [
17]. U-Net is based on a typical contracting network augmented with consecutive upsampling blocks. Hence, U-Net architecture consists of two parts: the contracting path from the input to the bottleneck, and the expansive path from the bottleneck to the output. The upsampling blocks of the model contain a large number of feature channels to effectively propagate context information to higher resolution layers.
The main distinction and the principal concept of U-Net is the presence of the specific skip connections that transmit activation maps from the intermediate layers of the contracting path to the intermediate layers of the expansive path. In other words, the high-resolution features from the contracting path are added to the upsampled output on the expansive path to facilitate localization. Subsequently, a consecutive convolution layer is capable of learning to generate a more precise output based on this information [
17].
For this research, we develop a U-Net architecture with the contracting path based on the convolutional core of modified ResNet50 (the upper half of
Figure 9). The input of the model is supposed to be a two-dimensional array with
size. Four skip connections and a bottleneck pass the multilevel features into the expansive path of the model (the lower half of
Figure 9). It leads to the output with the input size. With the equal size of the input and the output, the model architecture is aimed at unsupervised reconstruction of the input array.
As
Figure 9 shows, the expansive part of the model consists of ResNet pr upsampling ResNet blocks with the additional upsampling layers if necessary. On the expansive path, we coordinate the spatial size of the block outputs with the outputs of the skip connections to properly concatenate the tensors before the subsequent ResNet block or layer. The number of channels for the concatenated activation maps and the output size of the skip connections are shown in
Figure 9.
2.6.2. Regression Model
The second architecture built for this study is the regression CNN model designed to approximate the scalar target SWH value through processing
radar images. Unlike the regression model from [
16] with the same input and output objects, we base our model on the reconstruction architecture described in
Section 2.6.1 including the skip connections (see
Figure 10).
As
Figure 10 shows, our regression model consists of four fundamental parts. The first part is the convolutional core of modified ResNet50. We preserve four contracting paths to pass the outputs of the intermediate layers to the downsampling blocks. Every parallel downsampling block consists of a pooling layer, a convolutional layer, a pixel unshuffling layer and two consecutive convolutions. The task of the downscaling part of the model is to bring four activation maps of the skip connections to the same size. In the third part of the architecture, the features from the previous block are aggregated through channel concatenation. The aggregated activation map is passed to the downsampling block resulting in vector of size 64. The fully-connected subnet following the aggregation part contains two sequential fully-connected layers of the widths 64 and 8. The terminating layer is of the width 1 since in this study, we approximate SWH scalar.
2.7. Quality Metrics
Various quality metrics are employed to assess and monitor the learning process of the models. For the unsupervised reconstruction model, the simplest quality metric utilized is the root mean-squared error, denoted as . This form of reconstruction error quantifies the disparity between the input data and the data obtained after the U-Net compresses and decompresses the input. The metric provides an evaluation of the noise introduced to the input as it passes through the bottleneck and the skip connections.
If we denote our reconstruction model as
R and a batch of input images as
y, then an output batch is
. We compute
summarizing only by unmasked points to exclude non-meaningful areas of radar images. The number of points in the input batch is
N, and the result is:
In (
6),
and
are the identical elements of the input and the output tensors. We summarize the difference between the input and the output point-by-point. The indices
i,
j and
k are in ranges of respective spatial dimensions and a batch size, taking into account distance mask.
The inadequacy of RMSE in detecting visually altered defective regions in images with consistent intensity values has been demonstrated [
18]. In this research, we adopt a perceptual quality metric grounded in structural similarity, which assesses the inter-dependencies among local image regions. Unlike RMSE, which compares pixel values, structural similarity takes into account contrast and structural information. The computation of the structural similarity index measure (SSIM) is outlined as follows:
where
is the pixel sample mean of
y,
is the pixel sample mean of
,
is the variance of
y,
is the variance of
,
is the cross-correlation of
y and
.
and
are constant values equal to
and
, respectively.
In this paper, we compute the Structural Similarity Index (SSIM) between two windows, each sized 128 px × 128 px, applied to both the input and the output images. The SSIM values range between and 1, where a higher SSIM signifies greater similarity.
In case of the regression model, only root mean-squared error
is utilized since the output of this CNN is a vector containing significant wave height values with a length equal to the batch size
:
where
is a target value measured by Spotter buoy (see
Section 2.1) or determined by the synthetic ocean surface (see
Section 2.2), and
is the output value of the regression model.
2.8. Training and Evaluation
The training process of artificial neural networks is known for its sensitivity to various details, and the selection of a training algorithm and hyperparameters plays a critical role in determining the quality of the resulting model. Currently, the Adam optimizer [
19] stands out as the most stable and widely employed training algorithm, utilizing a momentum approach to estimate lower-order moments of the loss function gradients. In our study, we leverage the Adam optimization procedure.
In the optimization algorithms, the batch size and the learning rate hold particular significance. Due to the substantial dimensions of SeaVision radar images, substantial variations in batch size were impractical. Instead, we selected the largest feasible batch size for our computer hardware (batch_size=2) to mitigate noise in CNN gradient estimates. Adhering to best practices, we fine-tuned the learning rate schedule not only to attain high quality in SWH regression but also to enhance robust generalization skills. Generalization is evaluated by scrutinizing the disparity between the quality estimated on the training and validation subsets, where a small gap signifies good generalization and a large gap indicates poor generalization.
Aligned with recent research findings, we implemented the specialized learning rate schedule. This cyclical schedule encompasses a cosine-shaped decrease in the learning rate throughout the training process. We incorporate a multiplicative form of increase in the simulated annealing period with each cosine cycle. We also employ exponential decay of the simulated annealing magnitude with each cosine cycle, utilizing the multiplicative form.
In this research, training consists of three principal consecutive steps. For all the steps, we use normalized values as it was described in
Section 2.3.
On step One, we train the reconstruction model using the dataset of synthetic images described in
Section 2.4. The dataset comprising 60,000 synthetic radar images is randomly divided into two segments: a training dataset (50,000 images) and a validation dataset (10,000 images).
In machine learning, it is customary to assess the performance of a model by computing quality metrics on a validation subset obtained through random sampling from the original set of labeled examples. This methodology presupposes that the examples are independent and identically distributed (i.i.d.). The random partition ensures a uniform distribution of modeled surface wind wave conditions across the datasets. Moreover, the distribution of SWH values are similar for both datasets.
We train the reconstruction model using the mean-squared error between an input and an output images summing only over unmasked points. The length of training is 80 epochs. The learning rate schedule for the reconstruction model is in
Figure 11. We change the first cosine cycle from
learning_rate to
learning_rate.
On step Two, we train the regression model using the same dataset of synthetic images described in
Section 2.4. The number of images for training and evaluation is the same as for step One. Nevertheless, we change the random split of the dataset to prevent overfitting.
Step Two consists of two stages: and . For both stages of step Two, the input of the model is the masked radar image. We train the regression model using the mean-squared error between an output value and a target SWH.
For stage
, we initialize the weights of the convolutional modified ResNet50 core and the skip connections with the weights from the pre-trained reconstruction model as described for step One. After that, we train only downsampling block, feature aggregation and fully-connected layers. The covolutional core weights are frozen. After 20 epochs, we start stage
, training the full regression model. Epoch 1 of stage
is inintialized with the weights of the full regression model after epoch 20 of stage
. The length of stage
training is 40 epochs. The learning rate schedule for step Two is in
Figure 12. We change the cosine cycles from
learning_rate to
learning_rate.
In investigations concerning the application of statistical models, particularly ANNs, for the analysis of remote sensing data, it is crucial to account for the autocorrelation inherent in the observational time series dataset. Owing to the inherent evolution of underlying physical phenomena, consecutive observations may demonstrate substantial autocorrelation, thereby influencing the accuracy of the model [
16].
On step Three, we train the regression model using the dataset of real SeaVision synthetic images described in
Section 2.1.
It is imperative to refrain from systematically incorporating consecutive examples into the training and testing sets. In this research, we tackle the challenge of strongly correlated successive examples by adopting station-wise random sampling, a strategy akin to the one employed in [
16]. This approach avoids the systematic sampling of successive examples into the training and validation subsets, ensuring a reliable assessment of our model’s quality.
Step Two consists of three stages: , and . The input of the model is the masked radar image. We train the regression model using the mean-squared error between an output value and a target Spotter buoy SWH.
For stage
, we initialize the weights of the full model with the weights from the pre-trained regression model after stage
. Then, we train only fully-connected block for 10 epochs. After that, on stage
we start training fully-connected block, feature aggregation and downsampling block while the covolutional core weights are frozen. After 40 epochs of stage
, we start stage
, training the full regression model. The length of stage
training is 20 epochs. The learning rate schedule for step Two is in
Figure 13. We change the cosine cycles from
learning_rate to
learning_rate for stages
and
, and from
learning_rate to
learning_rate for stage
.