Phase Congruency-Guided Cross-Scale Contextual Fusion Network for Salient Object Detection in Optical Remote Sensing Images

Junfang Jiang; Wanjin Wang; Xiaohui Lin; Pingping Miao; Lina Gao; Mingzhu Xu

doi:10.20944/preprints202603.2230.v1

Submitted:

26 March 2026

Posted:

27 March 2026

You are already at the latest version

Abstract

In recent years, salient object detection in optical remote sensing images (ORSI-SOD) has garnered increasing research attention. However, in practical applications, issues such as blurred target edges under low contrast and complex background interference continue to restrict the accuracy and robustness of detection. To address these problems, this paper proposes the Phase Congruency-Guided Cross-Scale Contextual Fusion Network (PCFNet). Specifically, we design a novel Phase Congruency Enhanced (PCE) Module to solve the problem of low contrast between targets and backgrounds. It acquires multi-scale phase features via Fourier decomposition, fuses them with Transformer shallow features and uses a tailored loss weighting mechanism to weight phase congruency learning for better PCE module adaptation. To address complex background interference, we design a novel Dynamic Residual Fusion (DRF) Module. It leverages dynamic spatial attention and residual connections to refine multi-scale features and enables the model to accurately capture effective target features under complex background interference. Experiments on ORSSD, EORSSD, and ORSI4199 benchmarks show that PCFNet outperforms 24 state-of the-art methods in core metrics, and ablation studies further confirm the effectiveness of each module.

Keywords:

salient object detection (SOD)

;

optical remote sensing images (ORSI)

;

phase congruency

;

dynamic spatial attention

;

residual connection

Subject:

Environmental and Earth Sciences - Remote Sensing

1. Introduction

Salient object detection (SOD) is a fundamental task in computer vision that mimics human visual attention mechanisms to automatically identify and segment the most visually prominent regions in images [1]. In recent years, advances in deep learning have fueled substantial progress in SOD for natural scene images (NSI-SOD), with successful applications in image retrieval, object recognition, visual tracking and various other vision-related tasks [2,3,4,5]. At the same time, SOD for optical remote sensing images (ORSI-SOD) has garnered growing interest. It delivers essential prior information for downstream tasks like environmental change monitoring, aviation navigation, underwater detection, and urban planning [6,7,8]. Unlike natural scene images (NSI) captured by handheld cameras, optical remote sensing images (ORSI) refer to color imagery obtained by satellite or aerial sensors (with a wavelength range of 400–760 nm), leading to inherent distinctions between NSI and ORSI. Specifically, NSI typically exhibit fixed viewing angles, consistent object scales, and relatively simple backgrounds, which facilitate high foreground–background contrast and clear object boundaries. In contrast, ORSI often involve diverse imaging perspectives, large variations in object scale, and heterogeneous land cover compositions. These characteristics commonly lead to low contrast edges and severe background clutter. These notable differences make mature NSI–SOD methods perform subpar when directly applied to ORSI–SOD scenarios. This highlights an urgent need for specialized approaches to fully adapt to the unique characteristics of ORSI, ensuring reliable and accurate salient object detection.

In the field of ORSI–SOD, a key challenge lies in the low contrast of target objects, often caused by uneven illumination, atmospheric effects, or long imaging distances. Although CNNs excel at local feature extraction, limited receptive fields and reliance on intensity variations make them struggle to capture sufficient discriminative cues in low contrast regions [9,10]. Transformers capture global contextual dependencies well, but patch level tokenization often blurs fine edge details, especially in low contrast scenarios [11,12].

To better recover edge details in low contrast scenes, some recent works explore frequency domain representations to enhance structural cues. However, these methods mainly rely on magnitude spectra and pay limited attention to phase information [13,14,15]. Phase Congruency (PC) measures the agreement of phase angles across multiple frequencies, which makes it naturally insensitive to intensity contrast while remaining highly sensitive to edges and textures. Motivated by this property, we propose a novel Phase Congruency Enhanced Module (PCE). It injects PC cues into a Transformer backbone to strengthen target feature representations in low contrast scenes. To further amplify the effect of enhanced phase features on key regions [16], we also design a loss weighting mechanism (pc_weight) for subsequent loss supervision. It accurately strengthens the feature expression of low–contrast targets and significantly improves detection accuracy.

Another challenge arises from severe background interference. ORSI backgrounds often contain abundant target–irrelevant components such as clouds, vegetation, water bodies, and terrain shadows. These elements exhibit visual similarity to salient objects and are therefore easily confused with true targets, causing significant interference in accurate recognition and segmentation. Inspired by the idea of “residual attention guidance” in the field of image reconstruction and cross–attention dynamic fusion in the field of autonomous driving [17,18,19], we design a novel Dynamic Residual Fusion Module(DRF). It’s designed to address complex background interference in ORSI–SOD. Its core lies in inter–scale cross–scale feature integration. The module first fuses shallow and deep features, then uses dynamic spatial attention and channel attention to purify the fused features. This highlights task relevant regions and suppresses redundant background. In addition, residual connections are used during fusion to preserve fine grained structural details and prevent weak edges from being lost under complex interference. As a result, DRF performs both feature fusion and feature purification, improving robust foreground segmentation under severe background interference.

The main contributions are as follows:

We propose a novel end–to–end network PCFNet. It integrates frequency domain phase enhancement with dynamic cross scale feature refinement to reduce blurred target boundaries under low contrast and improve robustness under complex background interference.
We design a novel PC module based on Fourier decomposition theory. It fuses multi scale phase features with shallow Transformer features to solve blurred target perception caused by low contrast and compensate for local detail loss.
We design a novel DRF module. It integrates dynamic spatial attention and residual connections to achieve complementary fusion and effective selection of multi scale features, suppressing complex background interference while preserving salient structures.
We conduct extensive experiments on three benchmark datasets (ORSSD, EORSSD, ORSI4199). Through quantitative comparison with 23 SOTA methods, ablation experiments, and qualitative analysis of complex scenes, the effectiveness and robustness of the proposed network and core modules are fully demonstrated.

2. Related Work

2.1. Salient Object Detection for NSI

Early SOD methods for Natural Scene Images (NSI) relied on handcrafted features, which only utilized low-level visual cues such as color contrast [20]. Due to the lack of high-level semantic capture capability, these methods exhibited insufficient robustness in complex scenes. With the rise of deep learning, encoder-decoder architectures became the mainstream. However, early deep learning models had limitations in modeling long-range semantic relationships. Luo et al. [21] confirmed that this limitation would lead to semantic incoherence when processing large-scale targets, making it difficult to adapt to the requirements of complex scenes.

The introduction of self-attention mechanisms achieved a performance breakthrough. Such methods can directly model long-range semantic dependencies by calculating the correlation weights of all pixel pairs. For example, the DETR framework proposed by Carion et al. [22] transformed SOD into a set prediction task, abandoning anchor boxes to realize end-to-end segmentation and greatly simplifying the process. Liu et al.’s [23] Swin Transformer adopted a shifted window mechanism, which reduced computational complexity while retaining global modeling capability, adapting to high-resolution image scenarios. To fuse global semantics and local details, Xie et al. [24] designed a dual-branch Pyramid Grafting Network, achieving feature complementarity through cross-attention guidance.

Although numerous high-performance solutions have been developed for NSI-SOD, the particularities of ORSI in terms of target-motion blur, scale range, and imaging perspective determine that these methods cannot be directly adapted to ORSI-SOD tasks. Nevertheless, validated technical frameworks and design ideas still provide important references for research on ORSI-SOD.

2.2. Salient Object Detection for ORSI

In light of the distinctive properties of ORSI, many researchers have introduced innovative ideas and designed dedicated, effective algorithms for ORSI-SOD.

Li et al. [25] pointed out that the unique characteristics of ORSI put forward higher requirements for the adaptability of detection algorithms. Early methods combining handcrafted features and threshold segmentation were only applicable to small-scale simple scenes. With the rise of deep learning, U-Net and its variants [26] became the mainstream. However, the local modeling limitation of CNNs is more prominent in remote sensing scenarios, which easily leads to missed detection of dense targets or incomplete segmentation of targets with complex topological structures.

Deep learning promoted the development of efficiency and various optimization strategies have been targeted core problems. Zeng et al. [27] designed an adaptive edge-aware semantic interaction network, which improved the boundary accuracy of irregular targets such as rivers and edges. Some studies [28] also used boundary enhancement loss functions to optimize segmentation integrity. Gao et al. [29] proposed the Adaptive Spatialization Transformer, which enhances salient region representation by non-uniform token allocation and suppresses background redundancy. Li et al. [30] proposed a global context relation-guided network to capture spatial dependencies and aggregate multi-level features. Cheng et al. [31] proposed the Multimodal-Guided Transformer, which fuses RGB and depth information for heterogeneous feature collaboration network adopted the design of “global semantic modeling and local detail calibration” and achieved performance break-through through authoritative calibration. Despite the progress of the aforementioned studies, core challenges such as weak features of low-contrast targets and complex backgrounds in ORSI-SOD remain unsolved.

2.3. Frequency-Domain Analysis

Frequency-domain analysis provides a complementary perspective to spatial feature extraction by capturing target structure and texture information, and it has been widely applied in computer vision tasks. For example, feature enhancement networks based on discrete cosine transform and multi-scale analysis models using wavelet transform both improve target detail by enhancing medium-high frequency components [32].

As a classic frequency-domain analysis method, Phase Congruency (PC) is based on Fourier decomposition theory. It quantifies the phase synchronization of multi-scale frequency components and inherently offers advantages of illumination robustness and complete detail preservation. However, in ORSI-SOD, the application of frequency domain analysis and phase congruency still requires further exploration.

3. Proposed Method

In this section, we will introduce the proposed PCFNet in detail. Section 3.1 presents an overview of the general framework of PCFNet. Section 3.2 outlines the Transformer-based feature extractor. Section 3.3and Section 3.4 describe the core components PCE and DRF in detail, respectively. Section 3.5 introduces the decoder selected for the model. Finally, Section 3.6 explains the loss function used in the model.

3.1. Framework Overview

As shown in Figure 1, the proposed PCFNet adopts an encoder-decoder architecture. The input image passes sequentially through four Swin-B blocks to extract features at four scales. At the shallowest level, the PCE module compensates for target detail loss caused by low contrast through phase congruency computation. Subsequently, deeper-level features are refined by the DRF module, which fuses and optimizes multi-scale features to suppress background interference while preserving salient structures. Finally, the hierarchical features are aggregated by the SGAED decoder to produce a high-resolution saliency map.

3.2. Swin Transformer-based Feature Extractor

Following existing studies, we select the classical Swin-B as the basic feature extractor [23,25]. Its hierarchical representation and window based local self attention are well suited to the dense prediction requirements of ORSI-SOD. Different from the original Swin-B architecture designed for general vision tasks, this paper based adaptive optimization targeting the characteristics of remote sensing images. We retain the four-level feature extraction structure of the backbone network, and adjust the channel dimensions of deep layers to balance semantic representation capability and computational cost. The Swin-B in Figure 1 corresponds to the transformer-based encoder subnetwork, which consists of four feature extraction blocks. We take the output of each block as the side-output feature map, defined as

f_{t}^{i} \in R^{C_{i} \times H_{i} \times W_{i}} (i \in {1, 2, 3, 4})

. For the input image, the channel numbers

C_{1, 2, 3, 4}

of each feature map are

{128, 256, 512, 1024}

, and the resolutions

H_{1, 2, 3, 4} = W_{1, 2, 3, 4}

correspond to

{112, 56, 28, 14}

respectively.

3.3. Phase Congruency Enhanced Module

ORSI often suffer from edge blurring and false target detection in low-contrast scenes, which impairs the discriminability of salient targets. To address this issue, we design a Phase Congruency Enhanced (PCE) Module (as shown in Figure 2), which is deployed after the first stage Swin-B feature extractor (for the feature map

f_{t}^{1}

). Through the synergy of frequency-domain phase feature analysis and spatial-domain feature enhancement, the PCE module strengthens fine-grained local features of remote sensing targets, providing high-quality inputs for subsequent cross-scale feature fusion.

1)A: The Wave Filter Module serves as the frequency-domain response extraction unit of the PCE module, responsible for capturing multi-scale and multi-directional edge information of targets. First, the first-stage feature

f_{t}^{1}

output by Swin-B is compressed to 32 channels via a

1 \times 1

convolution to reduce computational complexity:

f_{i n}^{1} = {Conv}_{1 \times 1} (f_{t}^{1})

(1)

Subsequently,

f_{i n}^{1}

is fed into the Wave Filter Module. It is successively processed by the Difference of Gaussians (DoG) filtering and the multi-scale multi-orientation LogGabor filtering. The module applies the same filtering operation to all 32 channels. For each channel, the LogGabor filter bank is constructed with 2 scales (wavelengths of 8 pixels and 12 pixels) and 6 evenly distributed orientations(

0^{\circ}, 30^{\circ}, 60^{\circ}, 90^{\circ}, 120^{\circ}, 150^{\circ}

). By this means, a total of 12 frequency-domain response maps(M) are generated, which cover edge texture features of targets at different scales and directions. They provide basic inputs for subsequent phase analysis.

2)B: The Phase Congruency Calculation module is the core functional unit of the PCE module. Based on the visual cognition law, image features concentrate at positions where the phase superposition of Fourier harmonics is maximum [34]. Therefore, this property can be exploited to achieve accurate localization of target features. After feeding M into this module, we first perform scale wise averaging. For each of the six orientations, we average the two scales with equal weights of 0.5 to obtain the refined

M^{'}

. Here, each of the six waves corresponds to one of the 6 orientations (denoted by

θ

, where

θ = 1, 2, \dots, 6

). Then, we decompose each response map of the refined six waves into amplitude components

A_{θ, c}

(c denotes the number of channels,

c = 1, 2, \dots, 32

) and phase components

P_{θ, c}

via Fast Fourier Transform (FFT) for each channel. We calculate the amplitude-weighted mean phase

{\bar{P}}_{c}

of the phase components:

{\bar{P}}_{c} = \frac{\sum_{θ = 1}^{6} (A_{θ, c} ⊙ P_{θ, c})}{\sum_{θ = 1}^{6} A_{θ, c} + ε}

(2)

where

ε

is a small constant to avoid division by zero. Then, in the “calculate” step, we compute the phase deviation and phase congruency via Eq. (3) to (5), where Eq. (3) is dedicated to phase deviation calculation and Eqs. (4)(5) for phase congruency evaluation.

Δ ϕ_{θ, c} = cos (P_{θ, c} - {\bar{P}}_{c}) - | sin (P_{θ, c} - {\bar{P}}_{c}) |

(3)

{PC}_{c}^{'} = \frac{\sum_{θ = 1}^{6} max (A_{θ, c} ⊙ (Δ ϕ_{θ, c} - T_{c}), 0)}{\sum_{θ = 1}^{6} A_{θ, c} + ε}

(4)

{PC}^{'} = Concat ({PC}_{1}^{'}, {PC}_{2}^{'}, \dots, {PC}_{32}^{'})

(5)

where

T_{c} = 0.5 \cdot {Avg}_{3 \times 3} ({\bar{A}}_{θ, c})

denotes the local amplitude mean threshold. Here,

{\bar{A}}_{θ}

is obtained by averaging the amplitude spectrum

A_{θ}

over the orientation dimension (

{\bar{A}}_{θ} = \frac{1}{6} \sum_{θ = 1}^{6} A_{θ}

).

{Avg}_{3 \times 3} (\cdot)

represents

3 \times 3

average pooling.

max (\cdot, 0)

filters out invalid phase contributions. Notably, the above operations are performed independently for each of the 32 channels in

f_{i n}^{1}

. Thus, the final dimension of

{PC}^{'}

is

32 \times 112 \times 112

, with a value range of

[0, 1]

. A larger value indicates a higher probability that the position is a target feature.

3)C: Stripe Suppression Module has a dual function of eliminating stripe artifacts and recovering channel dimension. As shown in Figure 3, (a) and (c) are feature maps exhibiting stripe artifacts caused by phase congruency computation without the stripe suppression module, while (b) and (d) are the clean feature maps obtained after applying the module, with stripe interference clearly eliminated. After getting the clean feature maps, it restores the feature channel dimension to be consistent with that of the input feature. This lays a solid foundation for subsequent spatial attention enhancement. The input of this module is the preliminary phase congruency map

{PC}^{'}

and the specific processing process is as follows:

\{\begin{matrix} \tilde{PC} = {PC}^{'} ⊙ σ ({DepthwiseConv}_{7 \times 7} ({PC}^{'})) \\ PC = σ ({Conv}_{1 \times 1} (Clamp (LayerNorm (\tilde{PC})))) \end{matrix}

(6)

σ (\cdot)

represents sigmoid activation function. First, the

7 \times 7

depthwise separable convolution is performed on

{PC}^{'}

to capture the response of stripe artifact regions. And the suppression mask is generated by the Sigmoid function. The mask is multiplied with the original

{PC}^{'}

in an element-wise manner to suppress the response of stripe regions. Then, the layer normalization is used to standardize the feature distribution. And the value clipping is adopted to limit the dynamic range of feature values

[- 3, 3]

, completing the artifact elimination and feature stabilization. Subsequently, the

1 \times 1

convolution is applied to the artifact-removed features to restore the channel dimension from 32 to 128. Finally, the Sigmoid function is used to map the feature values to the range of

[0, 1]

, generating the phase congruency weight map

PC

.

4)D: The Variance-Based Spatial Attention is the output enhancement unit of the PCE module, used to improve the spatial discriminability of features. We use variance because in ORSI, boundaries or texture regions of salient objects usually exhibit noticeable gray-level jumps, leading to high local variance. High variance typically reflects the contours or locations of salient objects, while low variance corresponds to smooth regions. After feeding

PC

into this module, we first calculate the mean

μ (PC)

and variance

σ (PC)

across all channels for each spatial position. And we concatenate them and compress the dimension via a

1 \times 1

convolution to generate spatial attention weights. Then we perform element-wise multiplication with

f_{t}^{1}

and add a residual connection to avoid semantic loss. Finally the enhanced feature

f_{p}^{1}

is outputted, as shown in Eq. (7):

f_{p}^{1} = f_{t}^{1} ⊙ σ ({Conv}_{1 \times 1} ([μ (PC), σ (PC)])) + f_{t}^{1}

(7)

The enhanced feature

f_{p}^{1}

with significantly strengthened target information is fed into the subsequent DRF module for cross-scale feature fusion.

3.4. Dynamic Residual Fusion (DRF) Module

The Dynamic Residual Fusion (DRF) Module is designed to alleviate the insufficient cross-scale feature fusion and suppress complex background interference. It leverages the residual attention mechanism to gradually achieve sufficient interaction between shallow and deep features in both channel and spatial dimensions. As shown in Figure 4, the DRF module consists of a feature concatenation unit, Residual Dual Attention Blocks (RABs), and a global residual connection. In the DRF, the feature concatenation unit is used to fuse cross-scale features. The dual attention mechanism is employed to filter out cross-scale features. And the residual connection ensures the preservation of detailed information and gradient stability during the feature enhancement process.

The input of DRF includes two cross-scale features

f_{t}^{i - 1}

and

f_{t}^{i}

. First,

f_{t}^{i - 1}

(after downsampling) and

f_{t}^{i}

are concatenated along the channel dimension to obtain the feature

f_{c a t}

. Then a

1 \times 1

convolution followed by BN and LeakyReLU is used to compress the channel dimension, yielding the initial feature

f_{i n i t}

.

f_{i n i t} = LeakyReLU (BN ({Conv}_{1 \times 1} (f_{c a t})))

(8)

Subsequently,

f_{i n i t}

is fed into a residual attention group composed of 5 stacked RABs. Enlightened by the gradient residual attention network [17], which effectively aggregates fine-grained features via stacked residual dense blocks. We choose 5 RABs as intra-group units by balancing the feature enhancement performance, computational overhead and gradient stability. The validity of this configuration is demonstrated in subsequent ablation experiments. For each RAB, the input feature is first preprocessed by two

3 \times 3

convolutions. The first layer is followed by BN and LeakyReLU, and the second layer only retains BN, resulting in the feature

f_{c o n v 2}

.

f_{c o n v 1} = LeakyReLU (BN ({Conv}_{3 \times 3} (f_{R, i n})))

(9)

f_{c o n v 2} = BN ({Conv}_{3 \times 3} (f_{c o n v 1}))

(10)

Based on

f_{c o n v 2}

, Channel Attention (CA) Module is first used to generate channel weights. Global max pooling (GMP) and global average pooling (GAP) along the channel dimension are performed on

f_{c o n v 2}

to obtain

f_{g m p}

and

f_{g a p}

. These two features are fed into a shared MLP for dimension reduction and restoration, then summed up and activated by sigmoid to generate the weight

w_{C A}

. The weight is element-wise multiplied with

f_{c o n v 2}

.

f_{g m p} = {GMP}_{c} (f_{c o n v 2}), f_{g a p} = {GAP}_{c} (f_{c o n v 2})

(11)

w_{C A} = σ (MLP (f_{g m p}) + MLP (f_{g a p}))

(12)

f_{C A} = f_{c o n v 2} ⊙ w_{C A} + f_{c o n v 2}

(13)

Then, the Dynamic Spatial Attention (DSA) Module is applied to

f_{C A}

to achieve spatial feature enhancement. First, global average pooling is performed on the input feature

f_{C A}

, followed by two successive

1 \times 1

convolutions and ReLU activation. Then, a

3 \times 3

convolution kernel is generated through the Reshape operation. The module is dynamic in that the kernel shape remains

3 \times 3

, while its weight parameters are not fixed preset values. Instead, the weight parameters are adaptively calculated from the independent input features of each individual sample. Each sample in a batch is endowed with a unique set of

3 \times 3

convolution kernel weights, which can adapt to the spatial feature pattern of the sample itself in a targeted manner. This is the core embodiment of the “dynamic” property in our DSA module. Meanwhile, the channel-wise mean operation is conducted on

f_{C A}

to obtain a single-channel spatial feature map. The sample-specific

3 \times 3

dynamic convolution kernels are used to perform dynamic convolution on this single-channel spatial feature map. After that, the spatial attention weight

w_{D S A}

is generated by the Sigmoid activation function. Finally, the original feature

f_{C A}

is multiplied element wise by the spatial attention weight

w_{D S A}

to obtain the enhanced feature

f_{D S A}

.

k_{d y n} = Reshape ({Conv}_{1 \times 1} (ReLU ({Conv}_{1 \times 1} (AvgPool (f_{C A})))))

(14)

w_{D S A} = σ ({Conv}_{d y n} (mean (f_{C A})))

(15)

f_{D S A} = f_{C A} ⊙ w_{D S A}

(16)

The output of each RAB is added to the input feature via a local residual connection.

\begin{matrix} f_{R - g} & = f_{R - o u t}^{5} = {Conv}_{3 \times 3} (R A B_5 (R A B_4 \\ (R A B_3 (R A B_2 (R A B_1 (f_{i n i t})))))) \end{matrix}

(17)

f_{R - o u t}^{i} = f_{D S A} + f_{R - i n}^{i} (i = 1, 2, 3, 4, 5)

(18)

f_{r e s} = f_{R - g} + f_{i n i t}

(19)

f_{c}^{i} = {Conv}_{1 \times 1} (f_{r e s})

(20)

The output of five serially connected RABs is integrated into

f_{R - g}

through a

3 \times 3

convolution, then added to

f_{i n i t}

via a global residual connection to obtain

f_{r e s}

. Finally, a

1 \times 1

convolution is used to adjust the channel dimension, making it consistent with the

f_{t}^{i}

, yielding the fused feature

f_{c}^{i}

of DRF.

3.5. Decoder

To ensure the integrity of the network architecture and achieve accurate mapping from high-level fused features to pixel-level saliency maps, this paper directly adopts the Saliency-Guided Attention Enhanced Decoder (SGAED) from the literature as the decoding module of the network [31]. This decoder can adapt to the feature format output by the encoder in this paper without additional modifications. Its core function is to receive the three groups of high-level fused features (

f_{c}^{i} \in R^{256 \times \frac{112}{2^{i - 1}} \times \frac{112}{2^{i - 1}}}, i = 2, 3, 4

) from the DRF module and the enhanced shallow feature (

f_{p}^{1} \in R^{128 \times 112 \times 112}

) from the PCE module. In terms of the specific process,

f_{t}^{4}

feature and

S_{5} (14 \times 14)

generated from the

f_{t}^{4}

feature are fed into decoder4, followed by decoder4 to decoder1 that gradually perform feature upsampling and detail refinement (decoder4 does not perform upsampling,

f_{d e}^{i} \in R^{C \times \frac{112}{2^{i - 1}} \times \frac{112}{2^{i - 1}}}, S_{i} (\frac{112}{2^{i - 1}} \times \frac{112}{2^{i - 1}}), i = 2, 3, 4, f_{d e}^{2} \in R^{128 \times 112 \times 112}

). Specifically, decoder0 is designed to optimize the

S_{1} (112 \times 112)

features at the resolution of

112 \times 112

for generating the high-detail

S_{0} (224 \times 224)

, which supplies fine-grained supervision. Through repeated computation, key features are strengthened and representation capability is improved, while preserving the integrity of the five SGAED components in the original framework. Finally, all saliency maps are interpolated to

224 \times 224

and used as the supervised output and final predictions.

3.6. Loss Function

The Loss constructed in this paper combines CE Loss and IoU Loss [35,36]. And we design a novel weighting mechanism for the phase congruency map output by the phase congruency enhancement module.

1) Saliency Supervision We adopt CE Loss and IoU Loss to perform basic supervision on the last 4 layers of intermediate saliency maps, which is formulated as follows:

L_{B} = \sum_{i = 2}^{5} (L_{C E} ({Up}_{k_{i}} (S_{i}), G) + L_{IoU} ({Up}_{k_{i}} (S_{i}), G))

(21)

where

S_{i}

represents the i-th saliency prediction map (

i = 2, 3, 4, 5

), G is the ground-truth label map, and

{Up}_{k_{i}} (\cdot)

denotes the upsampling operation with an upsampling factor of

k_{i}

. The value of

k_{i}

is defined as:

k_{i} = \{\begin{matrix} 2^{4}, & if i = 5 \\ 2^{i}, & if i \leq 4 \end{matrix}

(22)

2) Phase Congruency Map Weighting (pc_weight) Existing loss functions fail to focus on salient structural regions in ORSI, so we design a phase congruency map weighting mechanism, referred to as pc_weight. It strengthens supervision on key regions and mitigates the impacts of weak features of low contrast targets. It should be noted that we only apply this weighted loss to

S_{0}

and

S_{1}

. Because the phase congruency enhancement module is embedded only in stage 1 of feature extraction, and its effect is limited to the shallow feature stage.

The pc_map refers to the single channel phase congruency map derived by averaging the 128 channels phase congruency map PC generated by Eq. (6) in Section 3.3 above. It reflects the distribution characteristics of salient structures such as edges and textures in the image. First, the original pc_map is interpolated to the size of the current prediction map and normalized to obtain the pixel-level weight

ω

(pc_weight), calculated as:

ω = \frac{{Up}_{s} (pc_map) - min ({Up}_{s} (pc_map))}{max ({Up}_{s} (pc_map)) - min ({Up}_{s} (pc_map)) + β}

(23)

where

{Up}_{s} (\cdot)

is the upsampling operation to interpolate pc_map to the size of the i-th prediction map.

β = 10^{- 8}

is used to avoid denominator being zero. For the first two layers of high-resolution intermediate saliency maps (

S_{i}, i = 0, 1

), this weight is introduced to weight the loss, formulated as:

\begin{matrix} L_{weighted}^{(i)} & = L_{C E} ({Up}_{k_{i}} (S_{i}) ⊙ ω, G ⊙ ω) \\ + L_{IoU} ({Up}_{k_{i}} (S_{i}) ⊙ ω, G ⊙ ω) \end{matrix}

(24)

ω

is the normalized pixel-level phase congruency weight map (ranging in

[0, 1]

),

{Up}_{k_{i}} (\cdot)

denotes the upsampling operation with an upsampling factor of

k_{i}

, and ⊙ represents the element-wise multiplication operation.

3) Overall Loss Function Finally, the overall loss function

L_{total}

of the model is formulated as follows:

L_{total} = L_{B} + \sum_{i \in {0, 1}} L_{weighted}^{(i)}

(25)

4. Experiments

4.1. Experimental Settings

1) Dataset We adopt three mainstream benchmark datasets for ORSI-SOD, including ORSSD [37], EORSSD [38] and ORSI4199 [39]. These datasets cover scenes of varying complexity to comprehensively test the model’s detection performance.

ORSSD, the first public dataset for remote sensing salient object detection, with 800 pixel-annotated images (600 training, 200 testing). It covers regular scenes, verifying the model’s basic performance.

EORSSD is an extension of ORSSD, with 2000 images (1400 training, 600 testing). It adds targets with complex backgrounds and irregular structures, increasing detection difficulty to test the model’s adaptability to complex scenes.

ORSI4199 is a large-scale challenging dataset, with 4199 high-precision annotated images (2000 training, 2199 testing). It includes multi-attribute complex targets (large, small, low-contrast) in realistic scenes, verifying the model’s generalization ability.

2) Implementation Details To verify the proposed network’s performance in ORSI-SOD, experiments are implemented via the PyTorch framework on a workstation with NVIDIA RTX 3090 Ti GPUs, using Python 3.8, PyTorch 1.11.0 and related libraries. During training, images are augmented with random flip, rotation and Gaussian blur, resized to

224 \times 224

. The Adam optimizer (initial learning rate

7 \times 10^{- 5}

,

β_{1} = 0.9

,

β_{2} = 0.999

) is used, with learning rate adjusted via StepLR (step size=20 epochs, decay factor=0.5). Batch size is 8 and the model is trained for 100 epochs.

3) Evaluation Metrics Four mainstream metrics are adopted in the experiments to comprehensively evaluate the model performance.

S-Measure (

S_{α}

) [40]: It measures the structural similarity between the saliency map and the ground truth, integrating object-level (

S_{o j}

) and region-level (

S_{r e}

) similarity, with the formula as follows:

S_{α} = α \times S_{o j} + (1 - α) \times S_{r e}

(26)

where the balance weight

α

is set to 0.5 by default.

F-Measure (

F_{β}

) [20]: It is a comprehensive metric that balances precision (

P r

) and recall (

R e

), calculated by:

F_{β} = \frac{(1 + β^{2}) \times P r \times R e}{β^{2} \times P r + R e}

(27)

Following the mainstream setting,

β

is set to 0.3.

E-measure (

E_{ξ}

) [41]: It considers both pixel-level correspondence and image-level statistical information simultaneously, with the formula:

E_{ξ} = \frac{1}{H \times W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} ϕ (x, y)

(28)

where H and W represent the height and width of the saliency map, and

ϕ (x, y)

denotes the enhanced alignment function.

MAE (M) [42]: It calculates the average pixel deviation between the saliency map and the ground truth, defined as:

MAE = \frac{1}{H \times W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} | S (x, y) - G (x, y) |

(29)

where

S (x, y)

and

G (x, y)

are the pixel values of the saliency map and the ground truth at position

(x, y)

, respectively.

4.2. Comparison with SOTA Methods

1) Comparison Methods Our proposed model (ours) and 23 state-of-the-art (SOTA) models are compared across all three benchmark datasets. The compared methods encompass a diverse range of models: RRWR [20] and RCRR [43] are two conventional NSI-SOD models. EGNet [44], MINet [45], and GatedNet [46] are three deep NSI-SOD models.

LVNet [37], DAFNet [38], MCCNet [47], CorrNet [48], ASTTNet [28], MJRBM [49], RRNet [40], EMFINet [50], ERPNet [44], ACCoNet [51], AESINet [27], DCCNet [52], LSHNet [53], MCPNet [54] are fourteen deep ORSI-SOD models. HFANet [55], ADSTNet [56], HFCNet [31], CMNFNet [57] are four Hybrid ORSI-SOD models. Table 1 lists all the quantitative results, which were generated by running the corresponding open-source codes provided by the author and adopting the default parameter configurations, or through calculations based on publicly accessible saliency maps.

2) Quantitative Comparisons and Discussions As shown in Table 1, our method (ours) performs prominently in key metrics across ORSSD, EORSSD, and ORSI4199 datasets. Overall, it outperforms most state-of-the-art methods. On the ORSSD dataset, our method ranks first in

S_{α}

,

F_{β}

,

E_{ξ}

, and MAE (0.9540, 0.9305, 0.9888, 0.0071 respectively). Compared with CMNFNet, it achieves a 0.65% improvement in

S_{α}

and a 1.16% improvement in

F_{β}

. On the EORSSD dataset, our method ranks first in

F_{β}

(0.8943),

E_{ξ}

(0.9843) and MAE (0.0048), and second in

S_{α}

(slightly behind HFCNet-R), while maintaining more balanced overall performance. On the ORSI4199 dataset, our method ranks first in two metrics, with

S_{α}

(0.8858) and

F_{β}

(0.8859). In summary, our method demonstrates competitive-ness across all datasets, verifying its effectiveness in ORSI-SOD.

3) Qualitative Comparisons and Discussions We select representative visual examples from the ORSSD, EORSSD, and ORSI4199 datasets. These examples are grouped by scene type (as shown in Figure 5) to qualitatively compare our method with state-of-the-art models.

For the first four rows representing low-contrast scenes, the target and background exhibit high texture similarity. In the 1st row, this similarity causes blurred edges in the saliency maps of ACCoNet and ERPNet. In the 2, 3, 4 rows, the subtle grayscale difference between target and background leads to missing targets or blurred adhesion in the saliency maps of CMNFNet and MJRBM. In contrast, our method enhances the discriminability of target features in low-contrast regions through the PCE module, producing saliency maps that accurately restore target contours and closely match the ground truth. For the last four rows representing complex background interference scenes, the complex background easily interferes with target detection. In Rows 5 and 6 with complex building scenes, models such as CMNFNet and MCCNet tend to introduce background noise spots. In contrast, our model separates the target from the background more clearly and remains robust to irrelevant background interference. In the 7th row featuring a dome-shaped scene, GatedNet suffers from edge distortion due to background element interference. While our dual module effectively resists redundant background interference and accurately aligns with the true target contour. In the 8th row, the target is extremely small and easily confused with the background, causing methods like CorrNet and MCCNet to miss the target or falsely activate background regions. In contrast, our method can detect it correctly.

4.3. Ablation Study

To verify the effectiveness of the proposed modules, core components, and loss functions, we conduct ablation experiments in this section. The ablations are designed at the module, component, and loss function levels. We further analyze the contribution of each part to model performance with visualization results.

1) Ablation study between different modules To verify the effectiveness of the proposed PCE and DRF modules, we designed four model variants on the ORSSD and EORSSD datasets: a) Base (Swin-B Encoder and SGAED Decoder), b) Base+PCE, c) Base+DRF, d) Base+DRF+PCE (ours).

The quantitative results (Table 2) demonstrate that: On the ORSSD dataset, the Base performs the worst. After integrating PCE,

S_{α}

is improved by 0.74%. When adding DRF,

S_{α}

increases by 0.64% while the

F_{β}

and

E_{ξ}

are slightly improved. Our method, which combines both modules, not only enhances the target discriminability in low-contrast regions via PCE but also aligns cross-scale features and filters background interference via DRF. Compared with the Baseline, our method achieves a 0.99% improvement in

S_{α}

and a 1.40% improvement in

F_{β}

. Consistent trends are observed on the EORSSD dataset, where our method improves

S_{α}

and

F_{β}

by 0.67% and 2.73% respectively compared with the Baseline, confirming the targeted value of the two modules in addressing typical issues of optical remote sensing images.

In Figure 6, the role of our modules is further demonstrated: In the scene of the 1st row, the Base result exhibits target edge diffusion due to background interference. Adding DRF makes the predicted contours more compact and well defined, suggesting that DRF suppresses redundant background responses. In the 2nd row with a low contrast linear target, the Base result shows discontinuities. Schemes with PCE preserve the target shape more completely, indicating PCE’s ability to enhance low contrast features. In the 3rd row, schemes with a single module lose details. Combining DRF and PCE preserves the target shape much better, showing their synergy in suppressing interference and enhancing features for more accurate segmentation.

To verify the enhancement effect of PCE on features extracted by Swin Transformer Stage 1, we generate feature heatmaps with a Tkinter based feature map viewer, as shown in Figure 7. The differences between the original Stage 1 heatmaps and the enhanced ones are clear, reflecting improved target details and saliency. In Row 1, the original Stage 1 features show small brightness differences between target and background and blurry contours. After enhancement, the target region becomes brighter and more distinguishable. In Row 2, the original features contain only sparse dot like highlights. After enhancement, the target region becomes more uniform and its shape is more complete and better separated from the background. In Row 3, the original edge responses are discontinuous. After enhancement, continuous highlight bands appear along the target edges. These observations indicate that PCE improves target–background separability in low contrast scenes and strengthens edge integrity, providing a stronger detail basis for subsequent cross-scale fusion.

2) Ablation study within the modules: To verify the effectiveness of the core components in the PCE and DRF modules, we designed ablation variants with key components removed. And we analyzed their performance on the ORSSD and EORSSD datasets.

For the PCE module, we designed two variants: w/o PC calculation (we removed the Filter, Phase Congruency Calculation, Stripe and only retained the VSA) and w/o VSA (PCE without VSA). As shown in Table 3(a), either removing the VSA or only having VSA leads to performance degradation. As shown in Figure 8(a), in the 1st row, removing these two parts results in the loss of texture details inside the target. In the 2nd row, “ours” accurately reproduces the target shape, while “w/o VSA” suffers from detail loss and background interference. This indicates that the phase congruency calculation enhances target edges and textures while suppressing background noise, and VSA focuses on the target region to strengthen effective feature responses. The two parts work synergistically, serving as the core of the PCE module for low-contrast feature enhancement, and are indispensable for performance improvement.

For the DRF module, we designed two variants: w/o CA (DRF without CA) and w/o DSA (DRF without DSA). The quantitative results (Table 3(b)) indicate that the performance decreases after removing CA or DSA. For dome shaped building targets in Figure 8(b), the saliency maps of w/o CA contain background clutter, and the target contours of w/o DSA are distorted. The river scene is similar in situation. The complete DRF module generates saliency maps with clear and intact shapes, verifying that the synergy between CA and DSA effectively resolves background challenges.

In this paper, we select 5 stacked RABs for the DRF module, which is the optimal choice based on ablation experiments. Table 4 shows that with the number of RABs from 3 to 5 improves all three core metrics (

S_{α}

,

F_{β}

,

E_{ξ}

) on the ORSSD dataset. This suggests that more RABs strengthen feature fusion and redundancy filtering. When the number of RABs exceeds 5, parameter redundancy causes gradient attenuation, and all metrics show a decline. Therefore, the stacked structure of 5 RABs enables the model to achieve the best performance.

3) The impact of our pc_weight on the loss function: This section verifies the effectiveness of pc_weight. In the original design, saliency map

S_{0}

and

S_{1}

use pc-weight combination of CE Loss and IOU Loss, while

S_{2}

–

S_{5}

adopt ordinary CE and IOU Loss. Removing pc_weight in this ablation experiment means that

S_{0}

and

S_{1}

also use only ordinary CE and IOU Loss (consistent with

S_{2}

–

S_{5}

). The experimental results are shown in Table 5 and the Figure 8(c). From Table 5, it can be observed that after removing pc_weight, only the basic losses of CE Loss and IOU Loss remain.

S_{α}

of ORSSD decreases from 0.9540 to 0.9484, and

F_{β}

drops from 0.9305 to 0.9244. This indicates that pc_weight effectively improves the model performance. The visualization of the ablation experiment for the proposed pc_weight is presented in Figure 8(c). For the low-contrast building targets, the prediction map of

w / o

pc_weight loses the prominent structures of the buildings, which demonstrates that pc_weight strengthens the detail supervision of low-contrast targets. For bridge targets, the prediction map of

w / o

pc_weight exhibits misinformation, proving that pc_weight constrains the structural integrity of the targets.

Table 5. The impact of our pc_weight on the loss function.

No.	Baseline	pc_weight	ORSSD			EORSSD
No.	Baseline	pc_weight	$S_{α} ↑$	$F_{β} ↑$	$E_{ξ} ↑$	$S_{α} ↑$	$F_{β} ↑$	$E_{ξ} ↑$
1	✓		0.9484	0.9244	0.9844	0.9323	0.8854	0.9641
2	✓	✓	0.9540	0.9305	0.9888	0.9393	0.8943	0.9843

4.4. Computational Complexity Analysis

1) Complexity Changes from Module Superposition The model in this paper uses TF and SGAED as the Baseline. After adding the PCE module, FLOPs increase by 6.67G, and Params increase by 0.53M. The PCE module only introduces lightweight frequency-domain filter operations, resulting in minimal complexity overhead. After adding the DRF module: FLOPs increase by 21G, and Params increase by 13M. The multi-group RAB attention blocks in DRF are the main source of complexity, but they ensure the effectiveness of cross-scale fusion.

2) Complexity Comparison with Existing Methods Our proposed model (ours) has 126.94G FLOPs and 117.29M parameters. For high-performance methods such as EMFINet (176.87G) and ACCoNet (184.50G), our model achieves better performance while reducing the complexity by around 30%. For HFCNet, our model achieves better performance with fewer parameters and comparable FLOPs. For models with lower complexity (e.g., AESINet), although our model has slightly higher complexity, its performance is significantly better (see Section 4.2 for details).

Table 6. Model Computational Complexity Comparison: (a) complexity analysis of the proposed modules, (b) comparison of complexity with some state-of-the-art methods.

(a)
Models	FLOPs	Params
TF	71.44 G	86.64 M
TF+SGAED	99.27 G	103.76 M
TF+SGAED+PCE	105.94 G (↑6.67)	104.29 M (↑0.53)
TF+SGAED+PCE+DRF(Ours)	126.94 G (↑21)	117.29 M (↑13)
(b)
Models	FLOPs	Params
MCCNet	117.15 G	67.65 M
EMFINet	176.87 G	95.09 M
ERPNet	131.63 G	77.19 M
ACCQNet	184.50 G	102.55 M
AESINet	53.42 G	41.05 M
ASTTNet	43.12 G	23.35 M
ADSTNet	62.09 G	27.72 M
HFCNet	120.41 G	140.75 M
ours	126.94 G	117.29 M

5. Conclusion

In this paper, we propose a novel PCFNet to tackle the complex target detection challenge in ORSI-SOD tasks. We embed PCE and DRF into the Swin Transformer backbone to build a unified pipeline that combines frequency domain detail enhancement, cross scale semantic fusion, and refined supervision. Specifically, PCE extracts frequency-domain phase features, which breaks the limitation of traditional spatial enhancement that relies on brightness differences. It improves the detail representation capability of targets in low-contrast scenes. The dual attention block (RAB) designed in DRF fuses channel and dynamic spatial attention to resolve problems of complex background interference. Extensive experiments on 3 datasets validate the superior performance of PCFNet.

Data Availability Statement

The data will be made publicly available upon acceptance of the paper.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work presented in jithis manuscript.

References

Borji, A.; Cheng, M.; Jiang, H.; Li, J. Salient Object Detection: A Benchmark. IEEE Transactions on Image Processing 2015, vol. 24(no. 12), 5706–5722. [Google Scholar] [CrossRef]
Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H.; Yang, R. Salient Object Detection in the Deep Learning Era: An In-Depth Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022, vol. 44(no. 6), 3239–3259. [Google Scholar] [CrossRef]
Zhang, P.; Zhuo, T.; Huang, W.; Chen, K.; Kankanhalli, M. Online object tracking based on CNN with spatial–temporal saliency guided sampling. Neurocomputing 2017, vol. 257, 115–127. [Google Scholar] [CrossRef]
Gao, L.; Liu, B.; Fu, P.; Xu, M.; Li, J. Visual Tracking via Dynamic Saliency Discriminative Correlation Filter. Applied Intelligence 2022, vol. 52(no. 6), 5897–5911. [Google Scholar] [CrossRef]
Song, X.; Lin, H.; Wen, H.; Hou, B.; Xu, M.; Nie, L. A Comprehensive Survey on Composed Image Retrieval. ACM Trans. Inf. Syst. 2025, vol. 44(no. 1, art. no. 19), 1–54. [Google Scholar] [CrossRef]
Li, C.; Guo, C.; Ren, W.; Cong, R.; Hou, J.; Kwong, S. An Underwater Image Enhancement Benchmark Dataset and Beyond. IEEE Transactions on Image Processing 2020, vol. 29, 4376–4389. [Google Scholar] [CrossRef]
Yang, L.; Wu, J.; Li, H.; Liu, C.; Wei, S. Real-Time Runway Detection Using Dual-Modal Fusion of Visible and Infrared Data. Remote Sensing vol. 17(no. 4), 669, 2025. [CrossRef]
Lei, J.; Wang, H.; Lei, Z.; Li, J.; Rong, S. CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation. Remote Sensing vol. 17(no. 4), 707, 2025. [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016; pp. 770–778. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016; pp. 2921–2929. [Google Scholar]
Chen, J.; Zhang, H.; Gong, M.; Gao, Z. Collaborative Compensative Transformer Network for Salient Object Detection. Pattern Recognition 2024, vol. 154, art. no. 110600. [Google Scholar] [CrossRef]
Azad, R.; Kazerouni, A.; Azad, B.; Khodapanah Aghdam, E.; Velichko, Y.; Bagci, U.; Merhof, D. “Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection,” in Medical Image Computing and Computer Assisted Intervention (MICCAI). LNCS 2023, vol. 14222, 736–746. [Google Scholar]
Wang, X.; Wan, L.; Lin, D.; Feng, W. Phase-based fine-grained change detection. Expert Systems with Applications 2023, vol. 227, pp. 120181. [Google Scholar] [CrossRef]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012; pp. 733–740. [Google Scholar]
Zhang, Q.; Wang, S.; Wang, X.; Sun, Z.; Kwong, S.; Jiang, J. Geometry Auxiliary Salient Object Detection for Light Fields via Graph Neural Networks. IEEE Transactions on Image Processing 2021, vol. 30, 7578–7592. [Google Scholar] [CrossRef] [PubMed]
Xu, M.; Sun, Z.; Hu, Y.; Tang, H.; Hu, Y.; Song, X.; Nie, L. Superpixel Segmentation With Edge Guided Local-Global Attention Network. IEEE Transactions on Circuits and Systems for Video Technology 2025, vol. 35(no. 12), 11922–11934. [Google Scholar] [CrossRef]
Yuan, X.; Zhang, B.; Zhou, J.; Lian, C.; Zhang, Q.; Yue, J. Gradient residual attention network for infrared image super-resolution. Optics and Lasers in Engineering 2024, vol. 175, pp. 107998. [Google Scholar] [CrossRef]
Wang, F.; Jiang, M.; Qian, C.; Yang, S. Residual Attention Network for Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017; pp. 6450–6458. [Google Scholar]
Bi, J.; Wei, H.; Zhang, G.; Yang, K.; Song, Z. DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion. IEEE Latin America Transactions 2024, vol. 22(no. 2), 106–112. [Google Scholar] [CrossRef]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009; pp. 1597–1604. [Google Scholar]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. S. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems (NIPS) 2016, arXiv:1701.04128. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In in Computer Vision; Springer: Cham, 2020; vol. 12346, p. pp. 13. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. IEEE International Conference on Computer Vision (ICCV), 2021; pp. 9992–10002. [Google Scholar]
Xie, C.; Xia, C.; Ma, M.; Zhao, Z.; Chen, X.; Li, J. Pyramid Grafting Network for One-Stage High Resolution Saliency Detection. In in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022; pp. 11707–11716. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 2020, vol. 159, 296–307. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention 2015, vol. 9351, pp. 28. [Google Scholar]
Zeng, X.; Xu, M.; Hu, Y.; Tang, H.; Hu, Y.; Nie, L. Adaptive Edge-Aware Semantic Interaction Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2023, vol. 61, 1–16. [Google Scholar] [CrossRef]
Gao, L.; Liu, B.; Fu, P.; Xu, M. Adaptive Spatial Tokenization Transformer for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2023, vol. 61, 1–15. [Google Scholar] [CrossRef]
Cheng, B.; Liu, Z.; Tang, H.; Wang, Q. Multimodal-Guided Transformer Architecture for Remote Sensing Salient Object Detection. IEEE Transactions on Geoscience and Remote Sensing 2025, vol. 22, 1–5. [Google Scholar] [CrossRef]
Li, J.; Li, C.; Zheng, X.; Liu, X.; Tang, C. Global Context Relation-Guided Feature Aggregation Network for Salient Object Detection in Optical Remote Sensing Images. Remote Sensing 2024, vol. 16(no. 16), 2978. [Google Scholar] [CrossRef]
Liu, Y.; Xu, M.; Xiao, T.; Tang, H.; Hu, Y.; Nie, L. Heterogeneous Feature Collaboration Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2024, vol. 62, 1–14. [Google Scholar] [CrossRef]
Gao, F.; Fu, M.; Cao, J.; Dong, J.; Du, Q. Adaptive Frequency Enhancement Network for Remote Sensing Image Semantic Segmentation. IEEE Transactions on Geoscience and Remote Sensing 2025, vol. 63, 1–15. [Google Scholar] [CrossRef]
Xu, M.; Yu, C.; Li, Z.; Tang, H.; Hu, Y.; Nie, L. HDNet: A Hybrid Domain Network With Multiscale High-Frequency Information Enhancement for Infrared Small-Target Detection. IEEE Transactions on Geoscience and Remote Sensing vol. 63, 1–15, 2025. [CrossRef]
Xiao, P.; Feng, X.; Zhao, S.; She, J. Segmentation of High-resolution Remotely Sensed Imagery Based on Phase Congruency. ACTA GEODAETICA et CARTOGRAPHICA SINICA 2007, vol. 36(no. 2), 146–151. [Google Scholar]
Zhang, Z.; Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018; pp. 8792–8802. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. ACM Multimedia, 2016; pp. 516–520. [Google Scholar]
Li, C.; Cong, R.; Hou, J.; Zhang, S.; Qian, Y.; Kwong, S. Nested Network With Two-Stream Pyramid for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2019, vol. 57(no. 11), 9156–9166. [Google Scholar] [CrossRef]
Zhang, Q.; Cong, R.; Li, C.; Cheng, M.; Fang, Y.; Cao, X.; Zhao, Y.; Kwong, S. Dense Attention Fluid Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Image Processing 2021, vol. 30, 1305–1317. [Google Scholar] [CrossRef]
Tu, Z.; Wang, C.; Li, C.; Fan, M.; Zhao, H.; Luo, B. ORSI Salient Object Detection via Multiscale Joint Region and Boundary Model. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, 1–13. [Google Scholar] [CrossRef]
Fan, D.-P.; Cheng, M.-M.; Liu, Y.; Li, T.; Borji, A. Structure-Measure: A New Way to Evaluate Foreground Maps. IEEE International Conference on Computer Vision (ICCV), 2017; pp. 4558–4567. [Google Scholar]
Fan, D.-P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.-M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. Proc. Int. Joint Conf. Artif. Intell., 2018; pp. 698–704. [Google Scholar]
Yu, J.-G.; Zhao, J.; Tian, J.; Tan, Y. Maximal entropy random walk for region-based visual saliency. IEEE Transactions on Cybernetics 2014, vol. 44(no. 9), 1661–1672. [Google Scholar]
Yuan, Y.; Li, C.; Kim, J.; Cai, W.; Feng, D. D. Reversion correction and regularized random walk ranking for saliency detection. IEEE Transactions on Image Processing 2018, vol. 27(no. 3), 1311–1322. [Google Scholar] [CrossRef]
Zhou, X. Edge-guided recurrent positioning network for salient object detection in optical remote sensing images. IEEE Transactions on Cybernetics 2023, vol. 53(no. 1), 539–552. [Google Scholar] [CrossRef] [PubMed]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020; pp. 9413–9422. [Google Scholar]
Zhao, X.; Pang, Y.; Zhang, L.; Lu, H.; Zhang, L. Suppress and balance: A simple gated network for salient object detection. Proc. Eur. Conf. Comput. Vis., 2020; pp. 35–51. [Google Scholar]
Li, G.; Liu, Z.; Lin, W.; Ling, H. Multi-content complementation network for salient object detection in optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, Art. no. 5614513. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Bai, Z.; Lin, W.; Ling, H. Lightweight salient object detection in optical remote sensing images via feature correlation. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, 1–12. [Google Scholar] [CrossRef]
Tu, Z.; Wang, C.; Li, C.; Fan, M.; Zhao, H.; Luo, B. ORSI salient object detection via multiscale joint region and boundary model. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, Art.(no. 5607913). [Google Scholar] [CrossRef]
Zhou, X.; Shen, K.; Liu, Z.; Gong, C.; Zhang, J.; Yan, C. Edge-aware multiscale feature integration network for salient object detection in optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, 1–15. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Zeng, D.; Lin, W.; Ling, H. Adjacent context coordination network for salient object detection in optical remote sensing images. IEEE Transactions on Cybernetics 2023, vol. 53(no. 1), 526–538. [Google Scholar] [CrossRef]
Huang, J.; Huang, K. Dynamic Context Coordination for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2025, vol. 22, 1–5. [Google Scholar] [CrossRef]
Lee, S.; Cho, S.; Park, C.; Park, S.; Kim, J.; Lee, S. LSHNet: Leveraging Structure-Prior With Hierarchical Features Updates for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2024, vol. 62, 1–16. [Google Scholar] [CrossRef]
Huang, K.; Li, N.; Huang, J.; Tian, C. Exploiting Memory-Based Cross-Image Contexts for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2024, vol. 62, 1–15. [Google Scholar] [CrossRef]
Wang, Q.; Liu, Y.; Xiong, Z.; Yuan, Y. Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 2022, vol. 60, Art.(no. 5624915). [Google Scholar] [CrossRef]
Zhao, J.; Jia, Y.; Ma, L.; Yu, L. Adaptive dual-stream sparse transformer network for salient object detection in optical remote sensing images. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2024, vol. 17, 5173–5192. [Google Scholar] [CrossRef]
Xu, M.; Wang, S.; Hu, Y.; Tang, H.; Cong, R.; Nie, L. Cross-Model Nested Fusion Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Transactions on Cybernetics 2025, vol. 55(no. 11), 5332–5345. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall architecture of the proposed network. The network consists of four components: the Swin-B feature extractor, the PCE module,the DRF module, and the SGAED decoder.

Figure 2. Internal structure of PCE, consisting of Wave Filter Module(A), Phase Congruency Calculation(B), Stripe Suppression Module(C) and Variance-Based Spatial Attention(D)

Figure 3. (a)(c) are without the Stripe Suppression Module and (b)(d) are with it. (a)(b) form the first example pair and (c)(d) form the second example pair.

Figure 4. Internal structure of DRF. Five RABs are connected in serie.

Figure 5. Qualitative comparisons of our method with seven representative SOTA methods in eight challenging scenarios. Rows 1, 2, 3, 4: low-contrast scenes, Rows 5, 6, 7, 8: complex background interference scenes.

Figure 6. Ablation study between different modules.

Figure 7. Heatmap of the enhancement effect of the PCE module on the Stage 1 features of the Swin Transformer.

Figure 8. Visualization of ablation study results within modules: (a) ablation experiments inside the PCE module, (b) ablation experiments inside the DRF module, and (c) ablation experiments on the loss function.

Table 1. Quantitative comparison of our method with 23 methods proposed by other researchers on the ORSSD, EORSSD and ORSI4199 datasets. The symbol “↑” indicates that a higher value is better for the metric, while “↓” indicates that a lower value is better. The top three results are highlighted in red, blue, and green, respectively. Results of some models may be unavailable for partial datasets, which are indicated as “-”.

Method	Publication	Type	ORSSD				EORSSD				ORSI4199
Method	Publication	Type	$S_{α} ↑$	$F_{β} ↑$	$E_{ξ} ↑$	MAE↓	$S_{α} ↑$	$F_{β} ↑$	$E_{ξ} ↑$	MAE↓	$S_{α} ↑$	$F_{β} ↑$	$E_{ξ} ↑$	MAE↓
RRWR	2015 CVPR	T-NSI	0.6835	0.5590	0.7649	0.1324	0.5992	0.3993	0.6894	0.1677	0.6416	0.5407	0.7116	0.1717
RCRR	2018 TIP	T-NSI	0.6849	0.5591	0.7651	0.1277	0.6007	0.3995	0.6882	0.1644	0.6491	0.548	0.7192	0.1637
ASTTNet	2023 TGRS	T-ORSI	0.9347	0.9060	0.9794	0.0094	0.9253	0.8741	0.9580	0.006	0.8827	0.8788	0.9512	0.0273
EGNet	2019 ICCV	C-NSI	0.8721	0.8332	0.9731	0.0216	0.8601	0.7880	0.9570	0.0110	0.8516	0.8371	0.9241	0.0385
MINet	2020 CVPR	C-NSI	0.9040	0.8761	0.9545	0.0144	0.9040	0.8344	0.9442	0.0093	0.8116	0.7988	0.8961	0.0504
GatedNet	2020 ECCV	C-NSI	0.9186	0.8871	0.9664	0.0137	0.9114	0.8566	0.9610	0.0095	0.8545	0.8450	0.9256	0.0393
LVNet-V	2019 TGRS	C-ORSI	0.8815	0.8263	0.9456	0.0207	0.8630	0.7794	0.9254	0.0146	-	-	-	-
DAFNet-V	2021 TIP	C-ORSI	0.9191	0.8928	0.9771	0.0113	0.9166	0.8614	0.9861	0.0060	0.8492	0.8348	0.9181	0.0422
MCCNet-V	2021 TGRS	C-ORSI	0.9437	0.9155	0.9800	0.0087	0.9327	0.8904	0.9755	0.0066	-	-	-	-
CorrNet-V	2022 TGRS	C-ORSI	0.9380	0.9129	0.9790	0.0098	0.9289	0.8778	0.9696	0.0083	0.8626	0.8560	0.9333	0.0366
MJRBM-R	2022 TGRS	C-ORSI	0.9211	0.8885	0.9686	0.0145	0.9091	0.8555	0.9655	0.0099	0.8582	0.8511	0.9343	0.0372
RRNet-R	2022 TGRS	C-ORSI	0.9339	0.9011	0.9722	0.0113	0.9266	0.8743	0.9665	0.0082	0.8585	0.8500	0.9286	0.0367
EMFINet-R	2022 TGRS	C-ORSI	0.9432	0.9155	0.9813	0.0095	0.9319	0.8742	0.9712	0.0075	0.8712	0.8636	0.9403	0.0313
ERPNet-R	2023 TCYB	C-ORSI	0.9352	0.9036	0.9738	0.0114	0.9252	0.8743	0.9665	0.0082	0.8636	0.8528	0.9292	0.0388
ACCoNet-R	2023 TCYB	C-ORSI	0.9428	0.9149	0.9819	0.0087	0.9302	0.8821	0.9759	0.0067	0.8805	0.8688	0.9424	0.032
AESINet-R	2023 TGRS	C-ORSI	0.9455	0.9160	0.9814	0.0085	0.9347	0.8792	0.9757	0.0064	0.8755	0.8726	0.9459	0.0305
DCCNet	2024 LGRS	C-ORSI	0.9417	0.9168	0.9805	0.0092	0.9345	0.8887	0.9761	0.0067	0.8705	0.8619	0.9348	0.0347
LSHNet	2024 TGRS	C-ORSI	0.9491	0.9200	0.9824	0.0075	0.9370	0.8643	0.9761	0.0064	0.8759	0.8758	0.9462	0.0299
MCPNet	2024 TGRS	C-ORSI	0.9433	0.9135	0.9807	0.0090	0.9373	0.8868	0.9765	0.0070	0.8736	0.8667	0.9402	0.0324
HFANet-R	2022 TGRS	H-ORSI	0.9399	0.9117	0.9770	0.0092	0.9380	0.8876	0.9740	0.0071	0.8767	0.8700	0.9431	0.0314
ADSTNet-R	2024 JSTARS	H-ORSI	0.9379	0.9124	0.9807	0.0086	0.9311	0.8804	0.9769	0.0065	0.8710	0.8698	0.9433	0.0318
HFCNet-R	2024 TGRS	H-ORSI	0.9521	0.9247	0.9885	0.0073	0.9407	0.8864	0.9793	0.0054	0.8838	0.8833	0.9539	0.0277
CMNFNet	2025 TCYB	H-ORSI	0.9475	0.9189	0.9832	0.0078	0.9377	0.8851	0.9774	0.0063	0.8774	0.8752	0.9885	0.0301
ours	-	T-ORSI	0.9540	0.9305	0.9888	0.0071	0.9393	0.8943	0.9843	0.0048	0.8858	0.8859	0.9531	0.0279

Table 2. Ablation study between different modules on ORSSD and EORSSD.

No.	Base	PCE	DRF	ORSSD				EORSSD
No.	Base	PCE	DRF	$S_{α}$ ↑	$F_{β}$ ↑	$E_{ξ}$ ↑	$S_{α}$ ↑	$F_{β}$ ↑	$E_{ξ}$ ↑
1	✓			0.9441	0.9165	0.9666	0.9326	0.8670	0.9589
2	✓	✓		0.9511	0.9215	0.9686	0.9361	0.8706	0.9582
3	✓		✓	0.9505	0.9267	0.9868	0.9354	0.8901	0.9806
4	✓	✓	✓	0.9540	0.9305	0.9888	0.9393	0.8943	0.9843

Table 3. Ablation study within the modules on ORSSD and EORSSD datasets.

Model variants	ORSSD			EORSSD
Model variants	$S_{α} ↑$	$F_{β} ↑$	$E_{ξ} ↑$	$S_{α} ↑$	$F_{β} ↑$	$E_{ξ} ↑$
(a) Ablation study in PCE
ours	0.9540	0.9305	0.9888	0.9393	0.8943	0.9843
w/o PC calculation	0.9491	0.9239	0.9834	0.9328	0.8875	0.9788
w/o VSA	0.9491	0.9246	0.9840	0.9381	0.8917	0.9819
(b) Ablation study in DRF
ours	0.9540	0.9305	0.9888	0.9393	0.8943	0.9843
w/o CA	0.9489	0.9294	0.9806	0.9330	0.8866	0.9797
w/o DSA	0.9509	0.9262	0.9874	0.9364	0.8892	0.9783

Table 4. Ablation study on the number of RAB.

Model variants	$S_{α} ↑$	$F_{β} ↑$	$E_{ξ} ↑$
3 RAB	0.9457	0.9196	0.9824
4 RAB	0.9488	0.9219	0.9828
5 RAB	0.9540	0.9305	0.9888
6 RAB	0.9509	0.9276	0.9869
7 RAB	0.9531	0.9274	0.9863

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Phase Congruency-Guided Cross-Scale Contextual Fusion Network for Salient Object Detection in Optical Remote Sensing Images

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Salient Object Detection for NSI

2.2. Salient Object Detection for ORSI

2.3. Frequency-Domain Analysis

3. Proposed Method

3.1. Framework Overview

3.2. Swin Transformer-based Feature Extractor

3.3. Phase Congruency Enhanced Module

3.4. Dynamic Residual Fusion (DRF) Module

3.5. Decoder

3.6. Loss Function

4. Experiments

4.1. Experimental Settings

4.2. Comparison with SOTA Methods

4.3. Ablation Study

4.4. Computational Complexity Analysis

5. Conclusion

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe