Preprint
Article

This version is not peer-reviewed.

Phase Congruency-Guided Cross-Scale Contextual Fusion Network for Salient Object Detection in Optical Remote Sensing Images

Submitted:

26 March 2026

Posted:

27 March 2026

You are already at the latest version

Abstract
In recent years, salient object detection in optical remote sensing images (ORSI-SOD) has garnered increasing research attention. However, in practical applications, issues such as blurred target edges under low contrast and complex background interference continue to restrict the accuracy and robustness of detection. To address these problems, this paper proposes the Phase Congruency-Guided Cross-Scale Contextual Fusion Network (PCFNet). Specifically, we design a novel Phase Congruency Enhanced (PCE) Module to solve the problem of low contrast between targets and backgrounds. It acquires multi-scale phase features via Fourier decomposition, fuses them with Transformer shallow features and uses a tailored loss weighting mechanism to weight phase congruency learning for better PCE module adaptation. To address complex background interference, we design a novel Dynamic Residual Fusion (DRF) Module. It leverages dynamic spatial attention and residual connections to refine multi-scale features and enables the model to accurately capture effective target features under complex background interference. Experiments on ORSSD, EORSSD, and ORSI4199 benchmarks show that PCFNet outperforms 24 state-of the-art methods in core metrics, and ablation studies further confirm the effectiveness of each module.
Keywords: 
;  ;  ;  ;  

1. Introduction

Salient object detection (SOD) is a fundamental task in computer vision that mimics human visual attention mechanisms to automatically identify and segment the most visually prominent regions in images [1]. In recent years, advances in deep learning have fueled substantial progress in SOD for natural scene images (NSI-SOD), with successful applications in image retrieval, object recognition, visual tracking and various other vision-related tasks [2,3,4,5]. At the same time, SOD for optical remote sensing images (ORSI-SOD) has garnered growing interest. It delivers essential prior information for downstream tasks like environmental change monitoring, aviation navigation, underwater detection, and urban planning [6,7,8]. Unlike natural scene images (NSI) captured by handheld cameras, optical remote sensing images (ORSI) refer to color imagery obtained by satellite or aerial sensors (with a wavelength range of 400–760 nm), leading to inherent distinctions between NSI and ORSI. Specifically, NSI typically exhibit fixed viewing angles, consistent object scales, and relatively simple backgrounds, which facilitate high foreground–background contrast and clear object boundaries. In contrast, ORSI often involve diverse imaging perspectives, large variations in object scale, and heterogeneous land cover compositions. These characteristics commonly lead to low contrast edges and severe background clutter. These notable differences make mature NSI–SOD methods perform subpar when directly applied to ORSI–SOD scenarios. This highlights an urgent need for specialized approaches to fully adapt to the unique characteristics of ORSI, ensuring reliable and accurate salient object detection.
In the field of ORSI–SOD, a key challenge lies in the low contrast of target objects, often caused by uneven illumination, atmospheric effects, or long imaging distances. Although CNNs excel at local feature extraction, limited receptive fields and reliance on intensity variations make them struggle to capture sufficient discriminative cues in low contrast regions [9,10]. Transformers capture global contextual dependencies well, but patch level tokenization often blurs fine edge details, especially in low contrast scenarios [11,12].
To better recover edge details in low contrast scenes, some recent works explore frequency domain representations to enhance structural cues. However, these methods mainly rely on magnitude spectra and pay limited attention to phase information [13,14,15]. Phase Congruency (PC) measures the agreement of phase angles across multiple frequencies, which makes it naturally insensitive to intensity contrast while remaining highly sensitive to edges and textures. Motivated by this property, we propose a novel Phase Congruency Enhanced Module (PCE). It injects PC cues into a Transformer backbone to strengthen target feature representations in low contrast scenes. To further amplify the effect of enhanced phase features on key regions [16], we also design a loss weighting mechanism (pc_weight) for subsequent loss supervision. It accurately strengthens the feature expression of low–contrast targets and significantly improves detection accuracy.
Another challenge arises from severe background interference. ORSI backgrounds often contain abundant target–irrelevant components such as clouds, vegetation, water bodies, and terrain shadows. These elements exhibit visual similarity to salient objects and are therefore easily confused with true targets, causing significant interference in accurate recognition and segmentation. Inspired by the idea of “residual attention guidance” in the field of image reconstruction and cross–attention dynamic fusion in the field of autonomous driving [17,18,19], we design a novel Dynamic Residual Fusion Module(DRF). It’s designed to address complex background interference in ORSI–SOD. Its core lies in inter–scale cross–scale feature integration. The module first fuses shallow and deep features, then uses dynamic spatial attention and channel attention to purify the fused features. This highlights task relevant regions and suppresses redundant background. In addition, residual connections are used during fusion to preserve fine grained structural details and prevent weak edges from being lost under complex interference. As a result, DRF performs both feature fusion and feature purification, improving robust foreground segmentation under severe background interference.
The main contributions are as follows:
  • We propose a novel end–to–end network PCFNet. It integrates frequency domain phase enhancement with dynamic cross scale feature refinement to reduce blurred target boundaries under low contrast and improve robustness under complex background interference.
  • We design a novel PC module based on Fourier decomposition theory. It fuses multi scale phase features with shallow Transformer features to solve blurred target perception caused by low contrast and compensate for local detail loss.
  • We design a novel DRF module. It integrates dynamic spatial attention and residual connections to achieve complementary fusion and effective selection of multi scale features, suppressing complex background interference while preserving salient structures.
  • We conduct extensive experiments on three benchmark datasets (ORSSD, EORSSD, ORSI4199). Through quantitative comparison with 23 SOTA methods, ablation experiments, and qualitative analysis of complex scenes, the effectiveness and robustness of the proposed network and core modules are fully demonstrated.

3. Proposed Method

In this section, we will introduce the proposed PCFNet in detail. Section 3.1 presents an overview of the general framework of PCFNet. Section 3.2 outlines the Transformer-based feature extractor. Section 3.3and Section 3.4 describe the core components PCE and DRF in detail, respectively. Section 3.5 introduces the decoder selected for the model. Finally, Section 3.6 explains the loss function used in the model.

3.1. Framework Overview

As shown in Figure 1, the proposed PCFNet adopts an encoder-decoder architecture. The input image passes sequentially through four Swin-B blocks to extract features at four scales. At the shallowest level, the PCE module compensates for target detail loss caused by low contrast through phase congruency computation. Subsequently, deeper-level features are refined by the DRF module, which fuses and optimizes multi-scale features to suppress background interference while preserving salient structures. Finally, the hierarchical features are aggregated by the SGAED decoder to produce a high-resolution saliency map.

3.2. Swin Transformer-based Feature Extractor

Following existing studies, we select the classical Swin-B as the basic feature extractor [23,25]. Its hierarchical representation and window based local self attention are well suited to the dense prediction requirements of ORSI-SOD. Different from the original Swin-B architecture designed for general vision tasks, this paper based adaptive optimization targeting the characteristics of remote sensing images. We retain the four-level feature extraction structure of the backbone network, and adjust the channel dimensions of deep layers to balance semantic representation capability and computational cost. The Swin-B in Figure 1 corresponds to the transformer-based encoder subnetwork, which consists of four feature extraction blocks. We take the output of each block as the side-output feature map, defined as f t i R C i × H i × W i ( i { 1 , 2 , 3 , 4 } ) . For the input image, the channel numbers C 1 , 2 , 3 , 4 of each feature map are { 128 , 256 , 512 , 1024 } , and the resolutions H 1 , 2 , 3 , 4 = W 1 , 2 , 3 , 4 correspond to { 112 , 56 , 28 , 14 } respectively.

3.3. Phase Congruency Enhanced Module

ORSI often suffer from edge blurring and false target detection in low-contrast scenes, which impairs the discriminability of salient targets. To address this issue, we design a Phase Congruency Enhanced (PCE) Module (as shown in Figure 2), which is deployed after the first stage Swin-B feature extractor (for the feature map f t 1 ). Through the synergy of frequency-domain phase feature analysis and spatial-domain feature enhancement, the PCE module strengthens fine-grained local features of remote sensing targets, providing high-quality inputs for subsequent cross-scale feature fusion.
1)A: The Wave Filter Module serves as the frequency-domain response extraction unit of the PCE module, responsible for capturing multi-scale and multi-directional edge information of targets. First, the first-stage feature f t 1 output by Swin-B is compressed to 32 channels via a 1 × 1 convolution to reduce computational complexity:
f i n 1 = Conv 1 × 1 ( f t 1 )
Subsequently, f i n 1 is fed into the Wave Filter Module. It is successively processed by the Difference of Gaussians (DoG) filtering and the multi-scale multi-orientation LogGabor filtering. The module applies the same filtering operation to all 32 channels. For each channel, the LogGabor filter bank is constructed with 2 scales (wavelengths of 8 pixels and 12 pixels) and 6 evenly distributed orientations( 0 , 30 , 60 , 90 , 120 , 150 ). By this means, a total of 12 frequency-domain response maps(M) are generated, which cover edge texture features of targets at different scales and directions. They provide basic inputs for subsequent phase analysis.
2)B: The Phase Congruency Calculation module is the core functional unit of the PCE module. Based on the visual cognition law, image features concentrate at positions where the phase superposition of Fourier harmonics is maximum [34]. Therefore, this property can be exploited to achieve accurate localization of target features. After feeding M into this module, we first perform scale wise averaging. For each of the six orientations, we average the two scales with equal weights of 0.5 to obtain the refined M . Here, each of the six waves corresponds to one of the 6 orientations (denoted by θ , where θ = 1 , 2 , , 6 ). Then, we decompose each response map of the refined six waves into amplitude components A θ , c (c denotes the number of channels, c = 1 , 2 , , 32 ) and phase components P θ , c via Fast Fourier Transform (FFT) for each channel. We calculate the amplitude-weighted mean phase P ¯ c of the phase components:
P ¯ c = θ = 1 6 ( A θ , c P θ , c ) θ = 1 6 A θ , c + ε
where ε is a small constant to avoid division by zero. Then, in the “calculate” step, we compute the phase deviation and phase congruency via Eq. (3) to (5), where Eq. (3) is dedicated to phase deviation calculation and Eqs. (4)(5) for phase congruency evaluation.
Δ ϕ θ , c = cos ( P θ , c P ¯ c ) | sin ( P θ , c P ¯ c ) |
PC c = θ = 1 6 max A θ , c ( Δ ϕ θ , c T c ) , 0 θ = 1 6 A θ , c + ε
PC = Concat PC 1 , PC 2 , , PC 32
where T c = 0.5 · Avg 3 × 3 ( A ¯ θ , c ) denotes the local amplitude mean threshold. Here, A ¯ θ is obtained by averaging the amplitude spectrum A θ over the orientation dimension ( A ¯ θ = 1 6 θ = 1 6 A θ ). Avg 3 × 3 ( · ) represents 3 × 3 average pooling. max ( · , 0 ) filters out invalid phase contributions. Notably, the above operations are performed independently for each of the 32 channels in f i n 1 . Thus, the final dimension of PC is 32 × 112 × 112 , with a value range of [ 0 , 1 ] . A larger value indicates a higher probability that the position is a target feature.
3)C: Stripe Suppression Module has a dual function of eliminating stripe artifacts and recovering channel dimension. As shown in Figure 3, (a) and (c) are feature maps exhibiting stripe artifacts caused by phase congruency computation without the stripe suppression module, while (b) and (d) are the clean feature maps obtained after applying the module, with stripe interference clearly eliminated. After getting the clean feature maps, it restores the feature channel dimension to be consistent with that of the input feature. This lays a solid foundation for subsequent spatial attention enhancement. The input of this module is the preliminary phase congruency map PC and the specific processing process is as follows:
PC ˜ = PC σ DepthwiseConv 7 × 7 PC PC = σ Conv 1 × 1 Clamp LayerNorm PC ˜
σ ( · ) represents sigmoid activation function. First, the 7 × 7 depthwise separable convolution is performed on PC to capture the response of stripe artifact regions. And the suppression mask is generated by the Sigmoid function. The mask is multiplied with the original PC in an element-wise manner to suppress the response of stripe regions. Then, the layer normalization is used to standardize the feature distribution. And the value clipping is adopted to limit the dynamic range of feature values [ 3 , 3 ] , completing the artifact elimination and feature stabilization. Subsequently, the 1 × 1 convolution is applied to the artifact-removed features to restore the channel dimension from 32 to 128. Finally, the Sigmoid function is used to map the feature values to the range of [ 0 , 1 ] , generating the phase congruency weight map PC .
4)D: The Variance-Based Spatial Attention is the output enhancement unit of the PCE module, used to improve the spatial discriminability of features. We use variance because in ORSI, boundaries or texture regions of salient objects usually exhibit noticeable gray-level jumps, leading to high local variance. High variance typically reflects the contours or locations of salient objects, while low variance corresponds to smooth regions. After feeding PC into this module, we first calculate the mean μ ( PC ) and variance σ ( PC ) across all channels for each spatial position. And we concatenate them and compress the dimension via a 1 × 1 convolution to generate spatial attention weights. Then we perform element-wise multiplication with f t 1 and add a residual connection to avoid semantic loss. Finally the enhanced feature f p 1 is outputted, as shown in Eq. (7):
f p 1 = f t 1 σ Conv 1 × 1 [ μ ( PC ) , σ ( PC ) ] + f t 1
The enhanced feature f p 1 with significantly strengthened target information is fed into the subsequent DRF module for cross-scale feature fusion.

3.4. Dynamic Residual Fusion (DRF) Module

The Dynamic Residual Fusion (DRF) Module is designed to alleviate the insufficient cross-scale feature fusion and suppress complex background interference. It leverages the residual attention mechanism to gradually achieve sufficient interaction between shallow and deep features in both channel and spatial dimensions. As shown in Figure 4, the DRF module consists of a feature concatenation unit, Residual Dual Attention Blocks (RABs), and a global residual connection. In the DRF, the feature concatenation unit is used to fuse cross-scale features. The dual attention mechanism is employed to filter out cross-scale features. And the residual connection ensures the preservation of detailed information and gradient stability during the feature enhancement process.
The input of DRF includes two cross-scale features f t i 1 and f t i . First, f t i 1 (after downsampling) and f t i are concatenated along the channel dimension to obtain the feature f c a t . Then a 1 × 1 convolution followed by BN and LeakyReLU is used to compress the channel dimension, yielding the initial feature f i n i t .
f i n i t = LeakyReLU BN Conv 1 × 1 f c a t
Subsequently, f i n i t is fed into a residual attention group composed of 5 stacked RABs. Enlightened by the gradient residual attention network [17], which effectively aggregates fine-grained features via stacked residual dense blocks. We choose 5 RABs as intra-group units by balancing the feature enhancement performance, computational overhead and gradient stability. The validity of this configuration is demonstrated in subsequent ablation experiments. For each RAB, the input feature is first preprocessed by two 3 × 3 convolutions. The first layer is followed by BN and LeakyReLU, and the second layer only retains BN, resulting in the feature f c o n v 2 .
f c o n v 1 = LeakyReLU BN Conv 3 × 3 f R , i n
f c o n v 2 = BN Conv 3 × 3 f c o n v 1
Based on f c o n v 2 , Channel Attention (CA) Module is first used to generate channel weights. Global max pooling (GMP) and global average pooling (GAP) along the channel dimension are performed on f c o n v 2 to obtain f g m p and f g a p . These two features are fed into a shared MLP for dimension reduction and restoration, then summed up and activated by sigmoid to generate the weight w C A . The weight is element-wise multiplied with f c o n v 2 .
f g m p = GMP c ( f c o n v 2 ) , f g a p = GAP c ( f c o n v 2 )
w C A = σ MLP ( f g m p ) + MLP ( f g a p )
f C A = f c o n v 2 w C A + f c o n v 2
Then, the Dynamic Spatial Attention (DSA) Module is applied to f C A to achieve spatial feature enhancement. First, global average pooling is performed on the input feature f C A , followed by two successive 1 × 1 convolutions and ReLU activation. Then, a 3 × 3 convolution kernel is generated through the Reshape operation. The module is dynamic in that the kernel shape remains 3 × 3 , while its weight parameters are not fixed preset values. Instead, the weight parameters are adaptively calculated from the independent input features of each individual sample. Each sample in a batch is endowed with a unique set of 3 × 3 convolution kernel weights, which can adapt to the spatial feature pattern of the sample itself in a targeted manner. This is the core embodiment of the “dynamic” property in our DSA module. Meanwhile, the channel-wise mean operation is conducted on f C A to obtain a single-channel spatial feature map. The sample-specific 3 × 3 dynamic convolution kernels are used to perform dynamic convolution on this single-channel spatial feature map. After that, the spatial attention weight w D S A is generated by the Sigmoid activation function. Finally, the original feature f C A is multiplied element wise by the spatial attention weight w D S A to obtain the enhanced feature f D S A .
k d y n = Reshape Conv 1 × 1 ReLU Conv 1 × 1 AvgPool f C A
w D S A = σ Conv d y n mean f C A
f D S A = f C A w D S A
The output of each RAB is added to the input feature via a local residual connection.
f R - g = f R - o u t 5 = Conv 3 × 3 ( R A B _ 5 ( R A B _ 4 ( R A B _ 3 ( R A B _ 2 ( R A B _ 1 ( f i n i t ) ) ) ) ) )
f R - o u t i = f D S A + f R - i n i ( i = 1 , 2 , 3 , 4 , 5 )
f r e s = f R - g + f i n i t
f c i = Conv 1 × 1 ( f r e s )
The output of five serially connected RABs is integrated into f R - g through a 3 × 3 convolution, then added to f i n i t via a global residual connection to obtain f r e s . Finally, a 1 × 1 convolution is used to adjust the channel dimension, making it consistent with the f t i , yielding the fused feature f c i of DRF.

3.5. Decoder

To ensure the integrity of the network architecture and achieve accurate mapping from high-level fused features to pixel-level saliency maps, this paper directly adopts the Saliency-Guided Attention Enhanced Decoder (SGAED) from the literature as the decoding module of the network [31]. This decoder can adapt to the feature format output by the encoder in this paper without additional modifications. Its core function is to receive the three groups of high-level fused features ( f c i R 256 × 112 2 i 1 × 112 2 i 1 , i = 2 , 3 , 4 ) from the DRF module and the enhanced shallow feature ( f p 1 R 128 × 112 × 112 ) from the PCE module. In terms of the specific process, f t 4 feature and S 5 ( 14 × 14 ) generated from the f t 4 feature are fed into decoder4, followed by decoder4 to decoder1 that gradually perform feature upsampling and detail refinement (decoder4 does not perform upsampling, f d e i R C × 112 2 i 1 × 112 2 i 1 , S i ( 112 2 i 1 × 112 2 i 1 ) , i = 2 , 3 , 4 , f d e 2 R 128 × 112 × 112 ). Specifically, decoder0 is designed to optimize the S 1 ( 112 × 112 ) features at the resolution of 112 × 112 for generating the high-detail S 0 ( 224 × 224 ) , which supplies fine-grained supervision. Through repeated computation, key features are strengthened and representation capability is improved, while preserving the integrity of the five SGAED components in the original framework. Finally, all saliency maps are interpolated to 224 × 224 and used as the supervised output and final predictions.

3.6. Loss Function

The Loss constructed in this paper combines CE Loss and IoU Loss [35,36]. And we design a novel weighting mechanism for the phase congruency map output by the phase congruency enhancement module.
1) Saliency Supervision We adopt CE Loss and IoU Loss to perform basic supervision on the last 4 layers of intermediate saliency maps, which is formulated as follows:
L B = i = 2 5 L C E ( Up k i ( S i ) , G ) + L IoU ( Up k i ( S i ) , G )
where S i represents the i-th saliency prediction map ( i = 2 , 3 , 4 , 5 ), G is the ground-truth label map, and Up k i ( · ) denotes the upsampling operation with an upsampling factor of k i . The value of k i is defined as:
k i = 2 4 , if i = 5 2 i , if i 4
2) Phase Congruency Map Weighting (pc_weight) Existing loss functions fail to focus on salient structural regions in ORSI, so we design a phase congruency map weighting mechanism, referred to as pc_weight. It strengthens supervision on key regions and mitigates the impacts of weak features of low contrast targets. It should be noted that we only apply this weighted loss to S 0 and S 1 . Because the phase congruency enhancement module is embedded only in stage 1 of feature extraction, and its effect is limited to the shallow feature stage.
The pc_map refers to the single channel phase congruency map derived by averaging the 128 channels phase congruency map PC generated by Eq. (6) in Section 3.3 above. It reflects the distribution characteristics of salient structures such as edges and textures in the image. First, the original pc_map is interpolated to the size of the current prediction map and normalized to obtain the pixel-level weight ω (pc_weight), calculated as:
ω = Up s ( pc _ map ) min ( Up s ( pc _ map ) ) max ( Up s ( pc _ map ) ) min ( Up s ( pc _ map ) ) + β
where Up s ( · ) is the upsampling operation to interpolate pc_map to the size of the i-th prediction map. β = 10 8 is used to avoid denominator being zero. For the first two layers of high-resolution intermediate saliency maps ( S i , i = 0 , 1 ), this weight is introduced to weight the loss, formulated as:
L weighted ( i ) = L C E Up k i ( S i ) ω , G ω + L IoU Up k i ( S i ) ω , G ω
ω is the normalized pixel-level phase congruency weight map (ranging in [ 0 , 1 ] ), Up k i ( · ) denotes the upsampling operation with an upsampling factor of k i , and ⊙ represents the element-wise multiplication operation.
3) Overall Loss Function Finally, the overall loss function L total of the model is formulated as follows:
L total = L B + i { 0 , 1 } L weighted ( i )

4. Experiments

4.1. Experimental Settings

1) Dataset We adopt three mainstream benchmark datasets for ORSI-SOD, including ORSSD [37], EORSSD [38] and ORSI4199 [39]. These datasets cover scenes of varying complexity to comprehensively test the model’s detection performance.
ORSSD, the first public dataset for remote sensing salient object detection, with 800 pixel-annotated images (600 training, 200 testing). It covers regular scenes, verifying the model’s basic performance.
EORSSD is an extension of ORSSD, with 2000 images (1400 training, 600 testing). It adds targets with complex backgrounds and irregular structures, increasing detection difficulty to test the model’s adaptability to complex scenes.
ORSI4199 is a large-scale challenging dataset, with 4199 high-precision annotated images (2000 training, 2199 testing). It includes multi-attribute complex targets (large, small, low-contrast) in realistic scenes, verifying the model’s generalization ability.
2) Implementation Details To verify the proposed network’s performance in ORSI-SOD, experiments are implemented via the PyTorch framework on a workstation with NVIDIA RTX 3090 Ti GPUs, using Python 3.8, PyTorch 1.11.0 and related libraries. During training, images are augmented with random flip, rotation and Gaussian blur, resized to 224 × 224 . The Adam optimizer (initial learning rate 7 × 10 5 , β 1 = 0.9 , β 2 = 0.999 ) is used, with learning rate adjusted via StepLR (step size=20 epochs, decay factor=0.5). Batch size is 8 and the model is trained for 100 epochs.
3) Evaluation Metrics Four mainstream metrics are adopted in the experiments to comprehensively evaluate the model performance.
S-Measure ( S α ) [40]: It measures the structural similarity between the saliency map and the ground truth, integrating object-level ( S o j ) and region-level ( S r e ) similarity, with the formula as follows:
S α = α × S o j + ( 1 α ) × S r e
where the balance weight α is set to 0.5 by default.
F-Measure ( F β ) [20]: It is a comprehensive metric that balances precision ( P r ) and recall ( R e ), calculated by:
F β = ( 1 + β 2 ) × P r × R e β 2 × P r + R e
Following the mainstream setting, β is set to 0.3.
E-measure ( E ξ ) [41]: It considers both pixel-level correspondence and image-level statistical information simultaneously, with the formula:
E ξ = 1 H × W x = 1 H y = 1 W ϕ ( x , y )
where H and W represent the height and width of the saliency map, and ϕ ( x , y ) denotes the enhanced alignment function.
MAE (M) [42]: It calculates the average pixel deviation between the saliency map and the ground truth, defined as:
MAE = 1 H × W x = 1 H y = 1 W | S ( x , y ) G ( x , y ) |
where S ( x , y ) and G ( x , y ) are the pixel values of the saliency map and the ground truth at position ( x , y ) , respectively.

4.2. Comparison with SOTA Methods

1) Comparison Methods Our proposed model (ours) and 23 state-of-the-art (SOTA) models are compared across all three benchmark datasets. The compared methods encompass a diverse range of models: RRWR [20] and RCRR [43] are two conventional NSI-SOD models. EGNet [44], MINet [45], and GatedNet [46] are three deep NSI-SOD models.
LVNet [37], DAFNet [38], MCCNet [47], CorrNet [48], ASTTNet [28], MJRBM [49], RRNet [40], EMFINet [50], ERPNet [44], ACCoNet [51], AESINet [27], DCCNet [52], LSHNet [53], MCPNet [54] are fourteen deep ORSI-SOD models. HFANet [55], ADSTNet [56], HFCNet [31], CMNFNet [57] are four Hybrid ORSI-SOD models. Table 1 lists all the quantitative results, which were generated by running the corresponding open-source codes provided by the author and adopting the default parameter configurations, or through calculations based on publicly accessible saliency maps.
2) Quantitative Comparisons and Discussions As shown in Table 1, our method (ours) performs prominently in key metrics across ORSSD, EORSSD, and ORSI4199 datasets. Overall, it outperforms most state-of-the-art methods. On the ORSSD dataset, our method ranks first in S α , F β , E ξ , and MAE (0.9540, 0.9305, 0.9888, 0.0071 respectively). Compared with CMNFNet, it achieves a 0.65% improvement in S α and a 1.16% improvement in F β . On the EORSSD dataset, our method ranks first in F β (0.8943), E ξ (0.9843) and MAE (0.0048), and second in S α (slightly behind HFCNet-R), while maintaining more balanced overall performance. On the ORSI4199 dataset, our method ranks first in two metrics, with S α (0.8858) and F β (0.8859). In summary, our method demonstrates competitive-ness across all datasets, verifying its effectiveness in ORSI-SOD.
3) Qualitative Comparisons and Discussions We select representative visual examples from the ORSSD, EORSSD, and ORSI4199 datasets. These examples are grouped by scene type (as shown in Figure 5) to qualitatively compare our method with state-of-the-art models.
For the first four rows representing low-contrast scenes, the target and background exhibit high texture similarity. In the 1st row, this similarity causes blurred edges in the saliency maps of ACCoNet and ERPNet. In the 2, 3, 4 rows, the subtle grayscale difference between target and background leads to missing targets or blurred adhesion in the saliency maps of CMNFNet and MJRBM. In contrast, our method enhances the discriminability of target features in low-contrast regions through the PCE module, producing saliency maps that accurately restore target contours and closely match the ground truth. For the last four rows representing complex background interference scenes, the complex background easily interferes with target detection. In Rows 5 and 6 with complex building scenes, models such as CMNFNet and MCCNet tend to introduce background noise spots. In contrast, our model separates the target from the background more clearly and remains robust to irrelevant background interference. In the 7th row featuring a dome-shaped scene, GatedNet suffers from edge distortion due to background element interference. While our dual module effectively resists redundant background interference and accurately aligns with the true target contour. In the 8th row, the target is extremely small and easily confused with the background, causing methods like CorrNet and MCCNet to miss the target or falsely activate background regions. In contrast, our method can detect it correctly.

4.3. Ablation Study

To verify the effectiveness of the proposed modules, core components, and loss functions, we conduct ablation experiments in this section. The ablations are designed at the module, component, and loss function levels. We further analyze the contribution of each part to model performance with visualization results.
1) Ablation study between different modules To verify the effectiveness of the proposed PCE and DRF modules, we designed four model variants on the ORSSD and EORSSD datasets: a) Base (Swin-B Encoder and SGAED Decoder), b) Base+PCE, c) Base+DRF, d) Base+DRF+PCE (ours).
The quantitative results (Table 2) demonstrate that: On the ORSSD dataset, the Base performs the worst. After integrating PCE, S α is improved by 0.74%. When adding DRF, S α increases by 0.64% while the F β and E ξ are slightly improved. Our method, which combines both modules, not only enhances the target discriminability in low-contrast regions via PCE but also aligns cross-scale features and filters background interference via DRF. Compared with the Baseline, our method achieves a 0.99% improvement in S α and a 1.40% improvement in F β . Consistent trends are observed on the EORSSD dataset, where our method improves S α and F β by 0.67% and 2.73% respectively compared with the Baseline, confirming the targeted value of the two modules in addressing typical issues of optical remote sensing images.
In Figure 6, the role of our modules is further demonstrated: In the scene of the 1st row, the Base result exhibits target edge diffusion due to background interference. Adding DRF makes the predicted contours more compact and well defined, suggesting that DRF suppresses redundant background responses. In the 2nd row with a low contrast linear target, the Base result shows discontinuities. Schemes with PCE preserve the target shape more completely, indicating PCE’s ability to enhance low contrast features. In the 3rd row, schemes with a single module lose details. Combining DRF and PCE preserves the target shape much better, showing their synergy in suppressing interference and enhancing features for more accurate segmentation.
To verify the enhancement effect of PCE on features extracted by Swin Transformer Stage 1, we generate feature heatmaps with a Tkinter based feature map viewer, as shown in Figure 7. The differences between the original Stage 1 heatmaps and the enhanced ones are clear, reflecting improved target details and saliency. In Row 1, the original Stage 1 features show small brightness differences between target and background and blurry contours. After enhancement, the target region becomes brighter and more distinguishable. In Row 2, the original features contain only sparse dot like highlights. After enhancement, the target region becomes more uniform and its shape is more complete and better separated from the background. In Row 3, the original edge responses are discontinuous. After enhancement, continuous highlight bands appear along the target edges. These observations indicate that PCE improves target–background separability in low contrast scenes and strengthens edge integrity, providing a stronger detail basis for subsequent cross-scale fusion.
2) Ablation study within the modules: To verify the effectiveness of the core components in the PCE and DRF modules, we designed ablation variants with key components removed. And we analyzed their performance on the ORSSD and EORSSD datasets.
For the PCE module, we designed two variants: w/o PC calculation (we removed the Filter, Phase Congruency Calculation, Stripe and only retained the VSA) and w/o VSA (PCE without VSA). As shown in Table 3(a), either removing the VSA or only having VSA leads to performance degradation. As shown in Figure 8(a), in the 1st row, removing these two parts results in the loss of texture details inside the target. In the 2nd row, “ours” accurately reproduces the target shape, while “w/o VSA” suffers from detail loss and background interference. This indicates that the phase congruency calculation enhances target edges and textures while suppressing background noise, and VSA focuses on the target region to strengthen effective feature responses. The two parts work synergistically, serving as the core of the PCE module for low-contrast feature enhancement, and are indispensable for performance improvement.
For the DRF module, we designed two variants: w/o CA (DRF without CA) and w/o DSA (DRF without DSA). The quantitative results (Table 3(b)) indicate that the performance decreases after removing CA or DSA. For dome shaped building targets in Figure 8(b), the saliency maps of w/o CA contain background clutter, and the target contours of w/o DSA are distorted. The river scene is similar in situation. The complete DRF module generates saliency maps with clear and intact shapes, verifying that the synergy between CA and DSA effectively resolves background challenges.
In this paper, we select 5 stacked RABs for the DRF module, which is the optimal choice based on ablation experiments. Table 4 shows that with the number of RABs from 3 to 5 improves all three core metrics ( S α , F β , E ξ ) on the ORSSD dataset. This suggests that more RABs strengthen feature fusion and redundancy filtering. When the number of RABs exceeds 5, parameter redundancy causes gradient attenuation, and all metrics show a decline. Therefore, the stacked structure of 5 RABs enables the model to achieve the best performance.
3) The impact of our pc_weight on the loss function: This section verifies the effectiveness of pc_weight. In the original design, saliency map S 0 and S 1 use pc-weight combination of CE Loss and IOU Loss, while S 2 S 5 adopt ordinary CE and IOU Loss. Removing pc_weight in this ablation experiment means that S 0 and S 1 also use only ordinary CE and IOU Loss (consistent with S 2 S 5 ). The experimental results are shown in Table 5 and the Figure 8(c). From Table 5, it can be observed that after removing pc_weight, only the basic losses of CE Loss and IOU Loss remain. S α of ORSSD decreases from 0.9540 to 0.9484, and F β drops from 0.9305 to 0.9244. This indicates that pc_weight effectively improves the model performance. The visualization of the ablation experiment for the proposed pc_weight is presented in Figure 8(c). For the low-contrast building targets, the prediction map of w / o pc_weight loses the prominent structures of the buildings, which demonstrates that pc_weight strengthens the detail supervision of low-contrast targets. For bridge targets, the prediction map of w / o pc_weight exhibits misinformation, proving that pc_weight constrains the structural integrity of the targets.
Table 5. The impact of our pc_weight on the loss function.
Table 5. The impact of our pc_weight on the loss function.
No. Baseline pc_weight ORSSD EORSSD
S α F β E ξ S α F β E ξ
1 0.9484 0.9244 0.9844 0.9323 0.8854 0.9641
2 0.9540 0.9305 0.9888 0.9393 0.8943 0.9843

4.4. Computational Complexity Analysis

1) Complexity Changes from Module Superposition The model in this paper uses TF and SGAED as the Baseline. After adding the PCE module, FLOPs increase by 6.67G, and Params increase by 0.53M. The PCE module only introduces lightweight frequency-domain filter operations, resulting in minimal complexity overhead. After adding the DRF module: FLOPs increase by 21G, and Params increase by 13M. The multi-group RAB attention blocks in DRF are the main source of complexity, but they ensure the effectiveness of cross-scale fusion.
2) Complexity Comparison with Existing Methods Our proposed model (ours) has 126.94G FLOPs and 117.29M parameters. For high-performance methods such as EMFINet (176.87G) and ACCoNet (184.50G), our model achieves better performance while reducing the complexity by around 30%. For HFCNet, our model achieves better performance with fewer parameters and comparable FLOPs. For models with lower complexity (e.g., AESINet), although our model has slightly higher complexity, its performance is significantly better (see Section 4.2 for details).
Table 6. Model Computational Complexity Comparison: (a) complexity analysis of the proposed modules, (b) comparison of complexity with some state-of-the-art methods.
Table 6. Model Computational Complexity Comparison: (a) complexity analysis of the proposed modules, (b) comparison of complexity with some state-of-the-art methods.
(a)
Models FLOPs Params
TF 71.44 G 86.64 M
TF+SGAED 99.27 G 103.76 M
TF+SGAED+PCE 105.94 G (↑6.67) 104.29 M (↑0.53)
TF+SGAED+PCE+DRF(Ours) 126.94 G (↑21) 117.29 M (↑13)
(b)
Models FLOPs Params
MCCNet 117.15 G 67.65 M
EMFINet 176.87 G 95.09 M
ERPNet 131.63 G 77.19 M
ACCQNet 184.50 G 102.55 M
AESINet 53.42 G 41.05 M
ASTTNet 43.12 G 23.35 M
ADSTNet 62.09 G 27.72 M
HFCNet 120.41 G 140.75 M
ours 126.94 G 117.29 M

5. Conclusion

In this paper, we propose a novel PCFNet to tackle the complex target detection challenge in ORSI-SOD tasks. We embed PCE and DRF into the Swin Transformer backbone to build a unified pipeline that combines frequency domain detail enhancement, cross scale semantic fusion, and refined supervision. Specifically, PCE extracts frequency-domain phase features, which breaks the limitation of traditional spatial enhancement that relies on brightness differences. It improves the detail representation capability of targets in low-contrast scenes. The dual attention block (RAB) designed in DRF fuses channel and dynamic spatial attention to resolve problems of complex background interference. Extensive experiments on 3 datasets validate the superior performance of PCFNet.

Data Availability Statement

The data will be made publicly available upon acceptance of the paper.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work presented in jithis manuscript.

References

  1. Borji, A.; Cheng, M.; Jiang, H.; Li, J. Salient Object Detection: A Benchmark. IEEE Transactions on Image Processing 2015, vol. 24(no. 12), 5706–5722. [Google Scholar] [CrossRef]
  2. Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H.; Yang, R. Salient Object Detection in the Deep Learning Era: An In-Depth Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022, vol. 44(no. 6), 3239–3259. [Google Scholar] [CrossRef]
  3. Zhang, P.; Zhuo, T.; Huang, W.; Chen, K.; Kankanhalli, M. Online object tracking based on CNN with spatial–temporal saliency guided sampling. Neurocomputing 2017, vol. 257, 115–127. [Google Scholar] [CrossRef]
  4. Gao, L.; Liu, B.; Fu, P.; Xu, M.; Li, J. Visual Tracking via Dynamic Saliency Discriminative Correlation Filter. Applied Intelligence 2022, vol. 52(no. 6), 5897–5911. [Google Scholar] [CrossRef]
  5. Song, X.; Lin, H.; Wen, H.; Hou, B.; Xu, M.; Nie, L. A Comprehensive Survey on Composed Image Retrieval. ACM Trans. Inf. Syst. 2025, vol. 44(no. 1, art. no. 19), 1–54. [Google Scholar] [CrossRef]
  6. Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S. An Underwater Image Enhancement Benchmark Dataset and Beyond. IEEE Transactions on Image Processing 2020, vol. 29, 4376–4389. [Google Scholar] [CrossRef]
  7. Yang, L.; Wu, J.; Li, H.; Liu, C.; Wei, S. Real-Time Runway Detection Using Dual-Modal Fusion of Visible and Infrared Data. Remote Sensing vol. 17(no. 4), 669, 2025. [CrossRef]
  8. Lei, J.; Wang, H.; Lei, Z.; Li, J.; Rong, S. CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation. Remote Sensing vol. 17(no. 4), 707, 2025. [CrossRef]
  9. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016; pp. 770–778. [Google Scholar]
  10. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016; pp. 2921–2929. [Google Scholar]
  11. Chen, J.; Zhang, H.; Gong, M.; Gao, Z. Collaborative Compensative Transformer Network for Salient Object Detection. Pattern Recognition 2024, vol. 154, art. no. 110600. [Google Scholar] [CrossRef]
  12. Azad, R.; Kazerouni, A.; Azad, B.; Khodapanah Aghdam, E.; Velichko, Y.; Bagci, U.; Merhof, D. “Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection,” in Medical Image Computing and Computer Assisted Intervention (MICCAI). LNCS 2023, vol. 14222, 736–746. [Google Scholar]
  13. Wang, X.; Wan, L.; Lin, D.; Feng, W. Phase-based fine-grained change detection. Expert Systems with Applications 2023, vol. 227, pp. 120181. [Google Scholar] [CrossRef]
  14. Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012; pp. 733–740. [Google Scholar]
  15. Zhang, Q.; Wang, S.; Wang, X.; Sun, Z.; Kwong, S.; Jiang, J. Geometry Auxiliary Salient Object Detection for Light Fields via Graph Neural Networks. IEEE Transactions on Image Processing 2021, vol. 30, 7578–7592. [Google Scholar] [CrossRef] [PubMed]
  16. Xu, M.; Sun, Z.; Hu, Y.; Tang, H.; Hu, Y.; Song, X.; Nie, L. Superpixel Segmentation With Edge Guided Local-Global Attention Network. IEEE Transactions on Circuits and Systems for Video Technology 2025, vol. 35(no. 12), 11922–11934. [Google Scholar] [CrossRef]
  17. Yuan, X.; Zhang, B.; Zhou, J.; Lian, C.; Zhang, Q.; Yue, J. Gradient residual attention network for infrared image super-resolution. Optics and Lasers in Engineering 2024, vol. 175, pp. 107998. [Google Scholar] [CrossRef]
  18. Wang, F.; Jiang, M.; Qian, C.; Yang, S. Residual Attention Network for Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017; pp. 6450–6458. [Google Scholar]
  19. Bi, J.; Wei, H.; Zhang, G.; Yang, K.; Song, Z. DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion. IEEE Latin America Transactions 2024, vol. 22(no. 2), 106–112. [Google Scholar] [CrossRef]
  20. Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009; pp. 1597–1604. [Google Scholar]
  21. Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. S. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems (NIPS) 2016, arXiv:1701.04128. [Google Scholar]
  22. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In in Computer Vision; Springer: Cham, 2020; vol. 12346, p. pp. 13. [Google Scholar]
  23. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. IEEE International Conference on Computer Vision (ICCV), 2021; pp. 9992–10002. [Google Scholar]
  24. Xie, C.; Xia, C.; Ma, M.; Zhao, Z.; Chen, X.; Li, J. Pyramid Grafting Network for One-Stage High Resolution Saliency Detection. In in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022; pp. 11707–11716. [Google Scholar]
  25. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 2020, vol. 159, 296–307. [Google Scholar] [CrossRef]
  26. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention 2015, vol. 9351, pp. 28. [Google Scholar]
  27. Zeng, X.; Xu, M.; Hu, Y.; Tang, H.; Hu, Y.; Nie, L. Adaptive Edge-Aware Semantic Interaction Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2023, vol. 61, 1–16. [Google Scholar] [CrossRef]
  28. Gao, L.; Liu, B.; Fu, P.; Xu, M. Adaptive Spatial Tokenization Transformer for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2023, vol. 61, 1–15. [Google Scholar] [CrossRef]
  29. Cheng, B.; Liu, Z.; Tang, H.; Wang, Q. Multimodal-Guided Transformer Architecture for Remote Sensing Salient Object Detection. IEEE Transactions on Geoscience and Remote Sensing 2025, vol. 22, 1–5. [Google Scholar] [CrossRef]
  30. Li, J.; Li, C.; Zheng, X.; Liu, X.; Tang, C. Global Context Relation-Guided Feature Aggregation Network for Salient Object Detection in Optical Remote Sensing Images. Remote Sensing 2024, vol. 16(no. 16), 2978. [Google Scholar] [CrossRef]
  31. Liu, Y.; Xu, M.; Xiao, T.; Tang, H.; Hu, Y.; Nie, L. Heterogeneous Feature Collaboration Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2024, vol. 62, 1–14. [Google Scholar] [CrossRef]
  32. Gao, F.; Fu, M.; Cao, J.; Dong, J.; Du, Q. Adaptive Frequency Enhancement Network for Remote Sensing Image Semantic Segmentation. IEEE Transactions on Geoscience and Remote Sensing 2025, vol. 63, 1–15. [Google Scholar] [CrossRef]
  33. Xu, M.; Yu, C.; Li, Z.; Tang, H.; Hu, Y.; Nie, L. HDNet: A Hybrid Domain Network With Multiscale High-Frequency Information Enhancement for Infrared Small-Target Detection. IEEE Transactions on Geoscience and Remote Sensing vol. 63, 1–15, 2025. [CrossRef]
  34. Xiao, P.; Feng, X.; Zhao, S.; She, J. Segmentation of High-resolution Remotely Sensed Imagery Based on Phase Congruency. ACTA GEODAETICA et CARTOGRAPHICA SINICA 2007, vol. 36(no. 2), 146–151. [Google Scholar]
  35. Zhang, Z.; Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018; pp. 8792–8802. [Google Scholar]
  36. Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. ACM Multimedia, 2016; pp. 516–520. [Google Scholar]
  37. Li, C.; Cong, R.; Hou, J.; Zhang, S.; Qian, Y.; Kwong, S. Nested Network With Two-Stream Pyramid for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2019, vol. 57(no. 11), 9156–9166. [Google Scholar] [CrossRef]
  38. Zhang, Q.; Cong, R.; Li, C.; Cheng, M.; Fang, Y.; Cao, X.; Zhao, Y.; Kwong, S. Dense Attention Fluid Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Image Processing 2021, vol. 30, 1305–1317. [Google Scholar] [CrossRef]
  39. Tu, Z.; Wang, C.; Li, C.; Fan, M.; Zhao, H.; Luo, B. ORSI Salient Object Detection via Multiscale Joint Region and Boundary Model. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, 1–13. [Google Scholar] [CrossRef]
  40. Fan, D.-P.; Cheng, M.-M.; Liu, Y.; Li, T.; Borji, A. Structure-Measure: A New Way to Evaluate Foreground Maps. IEEE International Conference on Computer Vision (ICCV), 2017; pp. 4558–4567. [Google Scholar]
  41. Fan, D.-P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.-M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. Proc. Int. Joint Conf. Artif. Intell., 2018; pp. 698–704. [Google Scholar]
  42. Yu, J.-G.; Zhao, J.; Tian, J.; Tan, Y. Maximal entropy random walk for region-based visual saliency. IEEE Transactions on Cybernetics 2014, vol. 44(no. 9), 1661–1672. [Google Scholar]
  43. Yuan, Y.; Li, C.; Kim, J.; Cai, W.; Feng, D. D. Reversion correction and regularized random walk ranking for saliency detection. IEEE Transactions on Image Processing 2018, vol. 27(no. 3), 1311–1322. [Google Scholar] [CrossRef]
  44. Zhou, X. Edge-guided recurrent positioning network for salient object detection in optical remote sensing images. IEEE Transactions on Cybernetics 2023, vol. 53(no. 1), 539–552. [Google Scholar] [CrossRef] [PubMed]
  45. Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020; pp. 9413–9422. [Google Scholar]
  46. Zhao, X.; Pang, Y.; Zhang, L.; Lu, H.; Zhang, L. Suppress and balance: A simple gated network for salient object detection. Proc. Eur. Conf. Comput. Vis., 2020; pp. 35–51. [Google Scholar]
  47. Li, G.; Liu, Z.; Lin, W.; Ling, H. Multi-content complementation network for salient object detection in optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, Art. no. 5614513. [Google Scholar] [CrossRef]
  48. Li, G.; Liu, Z.; Bai, Z.; Lin, W.; Ling, H. Lightweight salient object detection in optical remote sensing images via feature correlation. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, 1–12. [Google Scholar] [CrossRef]
  49. Tu, Z.; Wang, C.; Li, C.; Fan, M.; Zhao, H.; Luo, B. ORSI salient object detection via multiscale joint region and boundary model. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, Art.(no. 5607913). [Google Scholar] [CrossRef]
  50. Zhou, X.; Shen, K.; Liu, Z.; Gong, C.; Zhang, J.; Yan, C. Edge-aware multiscale feature integration network for salient object detection in optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, 1–15. [Google Scholar] [CrossRef]
  51. Li, G.; Liu, Z.; Zeng, D.; Lin, W.; Ling, H. Adjacent context coordination network for salient object detection in optical remote sensing images. IEEE Transactions on Cybernetics 2023, vol. 53(no. 1), 526–538. [Google Scholar] [CrossRef]
  52. Huang, J.; Huang, K. Dynamic Context Coordination for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2025, vol. 22, 1–5. [Google Scholar] [CrossRef]
  53. Lee, S.; Cho, S.; Park, C.; Park, S.; Kim, J.; Lee, S. LSHNet: Leveraging Structure-Prior With Hierarchical Features Updates for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2024, vol. 62, 1–16. [Google Scholar] [CrossRef]
  54. Huang, K.; Li, N.; Huang, J.; Tian, C. Exploiting Memory-Based Cross-Image Contexts for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2024, vol. 62, 1–15. [Google Scholar] [CrossRef]
  55. Wang, Q.; Liu, Y.; Xiong, Z.; Yuan, Y. Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, Art.(no. 5624915). [Google Scholar] [CrossRef]
  56. Zhao, J.; Jia, Y.; Ma, L.; Yu, L. Adaptive dual-stream sparse transformer network for salient object detection in optical remote sensing images. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2024, vol. 17, 5173–5192. [Google Scholar] [CrossRef]
  57. Xu, M.; Wang, S.; Hu, Y.; Tang, H.; Cong, R.; Nie, L. Cross-Model Nested Fusion Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Cybernetics 2025, vol. 55(no. 11), 5332–5345. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overall architecture of the proposed network. The network consists of four components: the Swin-B feature extractor, the PCE module,the DRF module, and the SGAED decoder.
Figure 1. Overall architecture of the proposed network. The network consists of four components: the Swin-B feature extractor, the PCE module,the DRF module, and the SGAED decoder.
Preprints 205184 g001
Figure 2. Internal structure of PCE, consisting of Wave Filter Module(A), Phase Congruency Calculation(B), Stripe Suppression Module(C) and Variance-Based Spatial Attention(D)
Figure 2. Internal structure of PCE, consisting of Wave Filter Module(A), Phase Congruency Calculation(B), Stripe Suppression Module(C) and Variance-Based Spatial Attention(D)
Preprints 205184 g002
Figure 3. (a)(c) are without the Stripe Suppression Module and (b)(d) are with it. (a)(b) form the first example pair and (c)(d) form the second example pair.
Figure 3. (a)(c) are without the Stripe Suppression Module and (b)(d) are with it. (a)(b) form the first example pair and (c)(d) form the second example pair.
Preprints 205184 g003
Figure 4. Internal structure of DRF. Five RABs are connected in serie.
Figure 4. Internal structure of DRF. Five RABs are connected in serie.
Preprints 205184 g004
Figure 5. Qualitative comparisons of our method with seven representative SOTA methods in eight challenging scenarios. Rows 1, 2, 3, 4: low-contrast scenes, Rows 5, 6, 7, 8: complex background interference scenes.
Figure 5. Qualitative comparisons of our method with seven representative SOTA methods in eight challenging scenarios. Rows 1, 2, 3, 4: low-contrast scenes, Rows 5, 6, 7, 8: complex background interference scenes.
Preprints 205184 g005
Figure 6. Ablation study between different modules.
Figure 6. Ablation study between different modules.
Preprints 205184 g006
Figure 7. Heatmap of the enhancement effect of the PCE module on the Stage 1 features of the Swin Transformer.
Figure 7. Heatmap of the enhancement effect of the PCE module on the Stage 1 features of the Swin Transformer.
Preprints 205184 g007
Figure 8. Visualization of ablation study results within modules: (a) ablation experiments inside the PCE module, (b) ablation experiments inside the DRF module, and (c) ablation experiments on the loss function.
Figure 8. Visualization of ablation study results within modules: (a) ablation experiments inside the PCE module, (b) ablation experiments inside the DRF module, and (c) ablation experiments on the loss function.
Preprints 205184 g008
Table 1. Quantitative comparison of our method with 23 methods proposed by other researchers on the ORSSD, EORSSD and ORSI4199 datasets. The symbol “↑” indicates that a higher value is better for the metric, while “↓” indicates that a lower value is better. The top three results are highlighted in red, blue, and green, respectively. Results of some models may be unavailable for partial datasets, which are indicated as “-”.
Table 1. Quantitative comparison of our method with 23 methods proposed by other researchers on the ORSSD, EORSSD and ORSI4199 datasets. The symbol “↑” indicates that a higher value is better for the metric, while “↓” indicates that a lower value is better. The top three results are highlighted in red, blue, and green, respectively. Results of some models may be unavailable for partial datasets, which are indicated as “-”.
Method Publication Type ORSSD EORSSD ORSI4199
S α F β E ξ MAE↓ S α F β E ξ MAE↓ S α F β E ξ MAE↓
RRWR 2015 CVPR T-NSI 0.6835 0.5590 0.7649 0.1324 0.5992 0.3993 0.6894 0.1677 0.6416 0.5407 0.7116 0.1717
RCRR 2018 TIP T-NSI 0.6849 0.5591 0.7651 0.1277 0.6007 0.3995 0.6882 0.1644 0.6491 0.548 0.7192 0.1637
ASTTNet 2023 TGRS T-ORSI 0.9347 0.9060 0.9794 0.0094 0.9253 0.8741 0.9580 0.006 0.8827 0.8788 0.9512 0.0273
EGNet 2019 ICCV C-NSI 0.8721 0.8332 0.9731 0.0216 0.8601 0.7880 0.9570 0.0110 0.8516 0.8371 0.9241 0.0385
MINet 2020 CVPR C-NSI 0.9040 0.8761 0.9545 0.0144 0.9040 0.8344 0.9442 0.0093 0.8116 0.7988 0.8961 0.0504
GatedNet 2020 ECCV C-NSI 0.9186 0.8871 0.9664 0.0137 0.9114 0.8566 0.9610 0.0095 0.8545 0.8450 0.9256 0.0393
LVNet-V 2019 TGRS C-ORSI 0.8815 0.8263 0.9456 0.0207 0.8630 0.7794 0.9254 0.0146 - - - -
DAFNet-V 2021 TIP C-ORSI 0.9191 0.8928 0.9771 0.0113 0.9166 0.8614 0.9861 0.0060 0.8492 0.8348 0.9181 0.0422
MCCNet-V 2021 TGRS C-ORSI 0.9437 0.9155 0.9800 0.0087 0.9327 0.8904 0.9755 0.0066 - - - -
CorrNet-V 2022 TGRS C-ORSI 0.9380 0.9129 0.9790 0.0098 0.9289 0.8778 0.9696 0.0083 0.8626 0.8560 0.9333 0.0366
MJRBM-R 2022 TGRS C-ORSI 0.9211 0.8885 0.9686 0.0145 0.9091 0.8555 0.9655 0.0099 0.8582 0.8511 0.9343 0.0372
RRNet-R 2022 TGRS C-ORSI 0.9339 0.9011 0.9722 0.0113 0.9266 0.8743 0.9665 0.0082 0.8585 0.8500 0.9286 0.0367
EMFINet-R 2022 TGRS C-ORSI 0.9432 0.9155 0.9813 0.0095 0.9319 0.8742 0.9712 0.0075 0.8712 0.8636 0.9403 0.0313
ERPNet-R 2023 TCYB C-ORSI 0.9352 0.9036 0.9738 0.0114 0.9252 0.8743 0.9665 0.0082 0.8636 0.8528 0.9292 0.0388
ACCoNet-R 2023 TCYB C-ORSI 0.9428 0.9149 0.9819 0.0087 0.9302 0.8821 0.9759 0.0067 0.8805 0.8688 0.9424 0.032
AESINet-R 2023 TGRS C-ORSI 0.9455 0.9160 0.9814 0.0085 0.9347 0.8792 0.9757 0.0064 0.8755 0.8726 0.9459 0.0305
DCCNet 2024 LGRS C-ORSI 0.9417 0.9168 0.9805 0.0092 0.9345 0.8887 0.9761 0.0067 0.8705 0.8619 0.9348 0.0347
LSHNet 2024 TGRS C-ORSI 0.9491 0.9200 0.9824 0.0075 0.9370 0.8643 0.9761 0.0064 0.8759 0.8758 0.9462 0.0299
MCPNet 2024 TGRS C-ORSI 0.9433 0.9135 0.9807 0.0090 0.9373 0.8868 0.9765 0.0070 0.8736 0.8667 0.9402 0.0324
HFANet-R 2022 TGRS H-ORSI 0.9399 0.9117 0.9770 0.0092 0.9380 0.8876 0.9740 0.0071 0.8767 0.8700 0.9431 0.0314
ADSTNet-R 2024 JSTARS H-ORSI 0.9379 0.9124 0.9807 0.0086 0.9311 0.8804 0.9769 0.0065 0.8710 0.8698 0.9433 0.0318
HFCNet-R 2024 TGRS H-ORSI 0.9521 0.9247 0.9885 0.0073 0.9407 0.8864 0.9793 0.0054 0.8838 0.8833 0.9539 0.0277
CMNFNet 2025 TCYB H-ORSI 0.9475 0.9189 0.9832 0.0078 0.9377 0.8851 0.9774 0.0063 0.8774 0.8752 0.9885 0.0301
ours - T-ORSI 0.9540 0.9305 0.9888 0.0071 0.9393 0.8943 0.9843 0.0048 0.8858 0.8859 0.9531 0.0279
Table 2. Ablation study between different modules on ORSSD and EORSSD.
Table 2. Ablation study between different modules on ORSSD and EORSSD.
No. Base PCE DRF ORSSD EORSSD
S α F β E ξ S α F β E ξ
1 0.9441 0.9165 0.9666 0.9326 0.8670 0.9589
2 0.9511 0.9215 0.9686 0.9361 0.8706 0.9582
3 0.9505 0.9267 0.9868 0.9354 0.8901 0.9806
4 0.9540 0.9305 0.9888 0.9393 0.8943 0.9843
Table 3. Ablation study within the modules on ORSSD and EORSSD datasets.
Table 3. Ablation study within the modules on ORSSD and EORSSD datasets.
Model variants ORSSD EORSSD
S α F β E ξ S α F β E ξ
(a) Ablation study in PCE
ours 0.9540 0.9305 0.9888 0.9393 0.8943 0.9843
w/o PC calculation 0.9491 0.9239 0.9834 0.9328 0.8875 0.9788
w/o VSA 0.9491 0.9246 0.9840 0.9381 0.8917 0.9819
(b) Ablation study in DRF
ours 0.9540 0.9305 0.9888 0.9393 0.8943 0.9843
w/o CA 0.9489 0.9294 0.9806 0.9330 0.8866 0.9797
w/o DSA 0.9509 0.9262 0.9874 0.9364 0.8892 0.9783
Table 4. Ablation study on the number of RAB.
Table 4. Ablation study on the number of RAB.
Model variants S α F β E ξ
3 RAB 0.9457 0.9196 0.9824
4 RAB 0.9488 0.9219 0.9828
5 RAB 0.9540 0.9305 0.9888
6 RAB 0.9509 0.9276 0.9869
7 RAB 0.9531 0.9274 0.9863
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated