5.1. Datasets for UAV-Based Ecological Monitoring
Our experimental framework leverages four remote sensing datasets explicitly curated for UAV-based ecological monitoring applications: the Remote Sensing UAV-based Dataset for Qinghai Ecosystem (RSUAV-QH), RSSCN7[
45], UC Merced Land Use Dataset (UCM)[
46], and WHU-RS19[
47]. These collections provide UAV-compatible imagery captured under diverse environmental conditions, enabling robust super-resolution model development tailored to precision ecological assessment. The geographic and thematic diversity of these datasets is visually summarized in
Figure 10, highlighting landscapes critical for UAV ecological surveys including wetlands, grasslands, forests, and coastal ecosystems.
The RSUAV-QH dataset centers on UAV monitoring of the ecologically critical Sanjiangyuan Area (Source of Three Rivers) in Qinghai Province, China (E100.6∘, N36.1∘). Collected entirely via UAV platforms, this dataset captures high-resolution imagery essential for tracking grassland degradation, wetland health dynamics, and water resource changes—ecosystem processes requiring frequent multitemporal observation ideally suited to UAV deployment. The UAV imagery was acquired by a DJI Phantom 4 RTK drone flown at 30 m altitude during midday hours (12:00–14:00) under clear skies, yielding 0.82 cm spatial resolution. With 460 training and 140 test images facilitating super-resolution enhancement from to pixel resolution, this dataset directly addresses UAV payload limitations by enabling high-fidelity ecological diagnostics from lower-resolution captures. Its design supports monitoring of seasonal vegetation changes and anthropogenic impacts in fragile alpine ecosystems through UAV-optimized super-resolution.
The RSSCN7 dataset comprises 2,800 UAV-compatible images standardized at resolution, organized into seven land cover categories critical for UAV ecological surveys: grasslands, farmlands, forests, river/lake systems, industrial zones, residential areas, and parking facilities. This categorization aligns with UAV monitoring priorities such as agricultural health assessment, forest canopy condition evaluation, and riparian zone mapping. Each category contains 400 images subdivided across four spatial scales, simulating resolution variations encountered during UAV operations at different flight altitudes. Sourced globally, the imagery exhibits seasonal, weather, and phenological diversity that trains models to handle atmospheric turbulence, variable illumination, and cloud cover—common UAV operational challenges in ecological monitoring. The dataset enables robust super-resolution for detecting subtle ecological transitions, such as forest indicator species distribution or vegetation stress responses, under real-world UAV survey conditions.
Complementing this, the UC Merced Land Use Dataset (UCM) provides 2,100 aerial images simulating fixed-wing UAV perspectives, with each resolution image representing one of 21 land-use categories including agricultural fields, forests, and dense residential zones. Captured across diverse U.S. regions, it supports UAV applications in urban ecology and precision conservation planning at human-nature interfaces. The dataset’s fine-grained classifications enable super-resolution models to discern subtle ecological transitions in fragmented landscapes, such as biodiversity corridors in peri-urban areas or vegetation health in agricultural plots—tasks frequently addressed through UAV monitoring. Its urban-wildland interface scenes are particularly valuable for developing UAV-based traffic management and infrastructure monitoring systems in smart cities.
Expanding into complex coastal environments, the WHU-RS19 dataset contributes approximately 950 UAV-compatible images spanning 19 scene categories including beaches, harbors, deserts, and forests. With variable dimensions typically around pixels, it captures complex textures (e.g., forest canopies, coastal sediments) under diverse illumination and atmospheric conditions. This diversity trains super-resolution algorithms to overcome UAV-specific degradation challenges like motion blur during windy coastal flights or atmospheric haze in humid environments—critical for detecting ecological disturbances such as wetland loss or coastal erosion. The dataset’s emphasis on fine structural details supports UAV applications in ecological monitoring of coastal wetlands, where identifying cross-channel signatures of vegetation stress or sediment composition requires high-fidelity imagery.
Collectively, these datasets provide a UAV-centric foundation for advancing super-resolution techniques in ecological monitoring. The resolution enhancement from to demonstrated with RSUAV-QH exemplifies how computational approaches can overcome inherent UAV payload limitations, enabling high-fidelity ecological assessment without requiring expensive sensors or impractical flight altitudes. By focusing exclusively on UAV-compatible data with explicit ecological relevance—from alpine conservation and agricultural health to coastal ecosystems and urban-wildland interfaces—this framework supports UAV deployment for biodiversity monitoring, habitat fragmentation analysis, and precision conservation in challenging environments.
5.2. Experiment Settings
The evaluation methodology is specifically tailored to UAV-acquired remote sensing imagery for ecological monitoring applications, where super-resolution techniques enhance the spatial details critical for analyzing vegetation patterns, habitat structures, and biodiversity indicators. Performance assessment employs a comprehensive suite of six complementary metrics designed to quantify both pixel-level accuracy and perceptual quality, with particular emphasis on UAV remote sensing characteristics. The peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM)[
48] serve as fundamental full-reference metrics that measure reconstruction fidelity against high-resolution ground truth, essential for identifying fine-scale ecological features in UAV imagery. Given a reference high-resolution image
and its reconstructed counterpart
, the mean squared error forms the basis for PSNR calculation:
where
N represents the total number of pixels. The PSNR in decibels is subsequently derived as:
with
L denoting the maximum possible pixel value. The SSIM metric extends beyond pixel-wise comparison by evaluating structural coherence through local statistics, particularly valuable for maintaining texture integrity in UAV vegetation mapping:
where
and
represent local means,
and
denote variances,
is the covariance, and stabilization constants
,
prevent division instability.
Three specialized metrics address the unique requirements of UAV ecological surveys conducted at low altitudes. The root-mean-square error (RMSE) quantifies absolute pixel-wise deviation critical for biomass quantification in precision agriculture:
The spectral angle mapper (SAM)[
49] assesses cross-channel fidelity essential for species discrimination by computing angular differences between corresponding pixel vectors across cross-channel bands:
where
and
denote cross-channel vectors at pixel location
i. Perceptual quality is evaluated through two no-reference metrics adapted for UAV-based monitoring. The Natural Image Quality Evaluator (NIQE)[
50] models image statistics using a multivariate Gaussian distribution fit to natural scene patches, capturing distortions common in drone-acquired imagery:
where
and
represent feature mean and covariance of the reconstructed image, while
and
correspond to parameters derived from pristine natural images. The Learned Perceptual Image Patch Similarity (LPIPS)[
51] metric employs deep features extracted from a pre-trained convolutional network, evaluating visual quality relevant for the ecological interpretation of UAV imagery:
where
denotes activations from layer
l of a pre-trained VGG network,
are spatial dimensions at layer
l, and
represents channel-wise adaptive weights.
Superior reconstruction quality for UAV ecological applications is indicated by higher PSNR and SSIM values that ensure structural fidelity of habitat features, lower RMSE and SAM measurements that preserve radiometric accuracy for quantitative analysis, and reduced NIQE and LPIPS scores that capture perceptual degradation not reflected in traditional pixel-based metrics. All experiments were implemented in PyTorch and executed on an NVIDIA A100 40GB GPU within a high-performance computing environment suitable for processing UAV image datasets. The model processes randomly cropped low-resolution patches during training, with a batch size of 16 and random rotations applied for data augmentation to enhance generalization across diverse UAV flight patterns. Optimization employed the Adam algorithm with coefficients and , commencing with a learning rate of that decayed by a factor of 10 after 80 epochs over a total training duration of 200 epochs. The architecture incorporated 4 residual groups, each containing 6 block modules consistent with MambaIR configurations, featuring a state expansion factor of 16. Convolutional layers in upsampling and downsampling modules utilized kernel sizes of 3, 7, 13, and 17 with respective padding of 1, 3, 6, and 8 to maintain spatial dimensions appropriate for UAV image structures. The proposed model exhibits less parameters (around 2.1M) than the existing state-of-the-art models (around 2.9M) but still outperform the SOTA models in every metric.
5.3. Quantitative Results
The quantitative evaluation demonstrates that SIMSR consistently outperforms state-of-the-art methods across all datasets and metrics, though smaller in number of parameters, establishing new benchmarks in super-resolution performance.
Class-specific analysis on the RSSCN7 dataset demonstrates SIMSR’s critical advantages for applications in various types and patterns in remote sensing imagery. For monitoring geometrically complex infrastructures like industrial zones (cIndustry) and transportation facilities (gParking), SIMSR achieves LPIPS values of 0.3416 and 0.3489 respectively—outperforming alternatives by 1.4%–4.5%—through its gated delta mechanism that preserves structural integrity vital for urban change detection. In ecological monitoring scenarios featuring grasslands (aGrass) and forests (eForest), cross-channel fidelity is maintained with PSNR exceeding 30.27 dB and SAM below 0.1502, supporting accurate vegetation health assessment. The test-time training module extends the effective receptive field to capture irregular patterns in residential areas (fResident) and water bodies (dRiverLake), reducing spatial distortions and yielding 10%–15% RMSE improvements crucial for flood mapping. SIMSR’s heatmaps precisely delineate critical features like shorelines and wave patterns, achieving class-leading SSIM (0.8072) and SAM (0.1508) while suppressing spurious activations that degrade NIQE in homogeneous regions. These capabilities resolve fundamental conflicts between global context modeling and local detail preservation while overcoming spatial adaptation limitations in recurrent architectures, demonstrating consistent improvements particularly in high-frequency domains essential for remote sensing interpretation.
Benchmarks on other datasets also prove superior performances. On the UCM dataset, SIMSR achieves a PSNR of 24.8312 dB and SSIM of 0.8598, surpassing CNN-based SRCNN and VDSR by >3.5 dB and >0.15 SSIM, while exceeding Transformer-based SwinIR and HAT by >2.3 dB and >0.05 SSIM. Notably, it reduces LPIPS (reflecting perceptual fidelity) to 0.2198—significantly lower than MambaIR (0.2531) and MambaIRv2 (0.2568)—validating its superior alignment with human visual perception. Similarly, on WHU-RS19, SIMSR attains a record NIQE of 6.5012 (indicating enhanced naturalness) and LPIPS of 0.3544, demonstrating its robustness against noise and blur artifacts that persistently challenge comparative methods. These gains stem from SIMSR’s integration of a linear attention mechanism with delta rule-based memory updates , which dynamically filters high-frequency noise while adaptively sharpening edges—capabilities inherently limited in CNN architectures due to fixed receptive fields and in Transformers due to quadratic computational constraints.
Table 1.
Quantitative comparison results for RSUAV-QH, UCM and WHU-RS19 dataset.
Table 1.
Quantitative comparison results for RSUAV-QH, UCM and WHU-RS19 dataset.
| Datasets |
Method |
PSNR ↑ |
SSIM ↑ |
NIQE ↓ |
LPIPS ↓ |
RMSE ↓ |
SAM ↓ |
| RSUAV-QH |
SRCNN[27] |
21.1437 |
0.7093 |
9.8234 |
0.3373 |
7.7725 |
0.1506 |
| VDSR[28] |
21.3548 |
0.7346 |
10.3282 |
0.3338 |
7.5716 |
0.1602 |
| SwinIR[17] |
22.6073 |
0.7891 |
8.8427 |
0.2909 |
7.6129 |
0.1528 |
| HAT[31] |
23.8924 |
0.8617 |
8.3129 |
0.2189 |
6.4297 |
0.1583 |
| MambaIR[39] |
24.8382 |
0.8365 |
9.3182 |
0.2469 |
6.6814 |
0.1478 |
| MambaIRv2[40] |
23.6127 |
0.8173 |
9.5198 |
0.2504 |
7.1163 |
0.1476 |
| SIMSR (Ours) |
24.9281 |
0.8665 |
7.9073 |
0.2135 |
6.2426 |
0.1463 |
| UCM |
SRCNN[27] |
21.0571 |
0.7018 |
9.9216 |
0.3467 |
7.8808 |
0.1542 |
| VDSR[28] |
21.2666 |
0.7280 |
10.4365 |
0.3422 |
7.6733 |
0.1644 |
| SwinIR[17] |
22.5205 |
0.7812 |
8.9439 |
0.2983 |
7.7241 |
0.1565 |
| HAT[31] |
24.7515 |
0.8540 |
9.6211 |
0.2254 |
6.5343 |
0.1625 |
| MambaIR[39] |
23.8029 |
0.8291 |
9.4208 |
0.2531 |
6.7818 |
0.1514 |
| MambaIRv2[40] |
23.5250 |
0.8104 |
8.4174 |
0.2568 |
7.2239 |
0.1514 |
| SIMSR (Ours) |
24.8312 |
0.8598 |
8.1124 |
0.2198 |
6.3469 |
0.1501 |
| WHU-RS19 |
SRCNN[27] |
23.2700 |
0.7069 |
8.1523 |
0.3589 |
8.0819 |
0.1516 |
| VDSR[28] |
23.7775 |
0.7122 |
7.1796 |
0.3682 |
8.1232 |
0.1427 |
| SwinIR[17] |
23.4011 |
0.5908 |
7.8451 |
0.4720 |
7.8556 |
0.1473 |
| HAT[31] |
23.6580 |
0.5993 |
8.3204 |
0.4633 |
8.6052 |
0.1422 |
| MambaIR[39] |
23.7002 |
0.7084 |
6.6657 |
0.4052 |
8.0293 |
0.1506 |
| MambaIRv2[40] |
23.9886 |
0.7208 |
7.5345 |
0.3755 |
7.9644 |
0.1501 |
| SIMSR (Ours) |
24.2634 |
0.7296 |
6.5012 |
0.3544 |
7.8764 |
0.1406 |
5.4. Qualitative Results and Feature Analysis
To qualitatively evaluate the super-resolution capabilities of our proposed model, we present visual comparisons with baseline and state-of-the-art methods on representative images and heat maps of Local Attribution Maps (LAMs)[
52] from the RSSCN7, UCM, WHU-RS19 and RSUAV-QH datasets at 4× scale factors. LAM is a method based on Integrated Gradients[
53] designed to analyze and visualize the contribution of individual input pixels to the output of deep SR networks, which introduces a Diffusion Index (DI) to quantitatively measure the extent of pixel involvement in the reconstruction process. With LAM, we can identify how input pixels contribute to the selected region.
In the RSSCN7 dataset, agricultural scenes feature complex geographic and artificial elements where detail and edge processing critically determine model performance. As illustrated in
Figure 11, images depict airports, factories, and intercity viaducts. The proposed SIMSR demonstrates significant advantages in detail reconstruction and edge sharpening, particularly for intricate strip-like features prevalent in agricultural landscapes. In contrast, results from comparative models (SRCNN, VDSR, HAT, MambaIR) exhibit noticeable blurring, failing to accurately capture feature boundaries and consistently underperforming in reconstructing linear structures. Heat map analysis reveals superior capability in SIMSR: while competing models produce diffused heat maps lacking precision, our model displays focused activation patterns indicating comprehensive information extraction and fusion across all spatial details. This enhanced feature discrimination directly contributes to sharper output images with superior structural integrity.
Table 2.
Quantitative comparison results for individual classes of images in the RSSCN7 dataset.
Table 2.
Quantitative comparison results for individual classes of images in the RSSCN7 dataset.
| Classes |
Method |
PSNR ↑ |
SSIM ↑ |
NIQE ↓ |
LPIPS ↓ |
RMSE ↓ |
SAM ↓ |
| aGrass |
SRCNN[27] |
31.6593 |
0.7717 |
7.3087 |
0.4076 |
5.4812 |
0.1617 |
| VDSR[28] |
31.7324 |
0.7729 |
7.1464 |
0.4024 |
5.4694 |
0.1614 |
| SwinIR[17] |
31.7514 |
0.7647 |
7.5602 |
0.3620 |
5.3454 |
0.1579 |
| HAT[31] |
31.7586 |
0.7733 |
7.2579 |
0.4036 |
5.4552 |
0.1565 |
| MambaIR[39] |
32.5437 |
0.7951 |
7.3111 |
0.3375 |
4.9799 |
0.1548 |
| MambaIRv2[40] |
32.8623 |
0.8202 |
7.1536 |
0.3357 |
6.1316 |
0.1587 |
| SIMSR (Ours) |
32.9074 |
0.8257 |
7.0268 |
0.3312 |
4.8296 |
0.1539 |
| bField |
SRCNN[27] |
30.9887 |
0.6972 |
7.7422 |
0.4384 |
5.8524 |
0.1579 |
| VDSR[28] |
31.0113 |
0.6880 |
7.9297 |
0.4175 |
5.8004 |
0.1592 |
| SwinIR[17] |
31.0884 |
0.6996 |
7.5414 |
0.4309 |
5.8324 |
0.1563 |
| HAT[31] |
31.1053 |
0.6998 |
7.6590 |
0.4331 |
5.8254 |
0.1546 |
| MambaIR[39] |
31.5853 |
0.7124 |
7.7017 |
0.3976 |
5.5454 |
0.1542 |
| MambaIRv2[40] |
31.8543 |
0.7381 |
7.6077 |
0.3744 |
7.2795 |
0.1544 |
| SIMSR (Ours) |
31.9009 |
0.7428 |
7.4724 |
0.3713 |
5.4483 |
0.1530 |
| cIndustry |
SRCNN[27] |
23.8071 |
0.6530 |
7.5993 |
0.3959 |
7.7559 |
0.1540 |
| VDSR[28] |
24.2170 |
0.6841 |
6.9130 |
0.4155 |
7.7071 |
0.1544 |
| SwinIR[17] |
24.2811 |
0.6874 |
7.0884 |
0.3713 |
7.4907 |
0.1543 |
| HAT[31] |
24.5127 |
0.6972 |
7.0539 |
0.3946 |
7.6396 |
0.1539 |
| MambaIR[39] |
24.5198 |
0.6976 |
7.3548 |
0.3897 |
7.6512 |
0.1525 |
| MambaIRv2[40] |
25.1669 |
0.7423 |
6.1057 |
0.3465 |
7.5097 |
0.1522 |
| SIMSR (Ours) |
25.2485 |
0.7488 |
5.8562 |
0.3416 |
7.2029 |
0.1501 |
| dRiverLake |
SRCNN[27] |
26.0556 |
0.7788 |
6.8601 |
0.3210 |
6.1159 |
0.1512 |
| VDSR[28] |
28.9093 |
0.7741 |
6.7854 |
0.3498 |
5.7623 |
0.1517 |
| SwinIR[17] |
28.9152 |
0.7847 |
6.8352 |
0.3793 |
5.8470 |
0.1517 |
| HAT[31] |
29.0360 |
0.7872 |
6.8792 |
0.3729 |
5.8266 |
0.1632 |
| MambaIR[39] |
29.0572 |
0.7875 |
6.8479 |
0.3748 |
5.8134 |
0.1675 |
| MambaIRv2[40] |
29.4932 |
0.8008 |
6.8506 |
0.3075 |
5.4813 |
0.1698 |
| SIMSR (Ours) |
29.5927 |
0.8072 |
6.7513 |
0.3035 |
5.2786 |
0.1508 |
| eForest |
SRCNN[27] |
26.3516 |
0.5835 |
9.0943 |
0.5012 |
7.9684 |
0.1637 |
| VDSR[28] |
26.3947 |
0.5854 |
8.9165 |
0.5087 |
7.9465 |
0.1537 |
| SwinIR[17] |
26.4321 |
0.5852 |
8.8184 |
0.4994 |
7.9394 |
0.1514 |
| HAT[31] |
26.4655 |
0.5713 |
8.8667 |
0.4525 |
7.9155 |
0.1509 |
| MambaIR[39] |
26.8391 |
0.5948 |
9.2291 |
0.4448 |
7.7438 |
0.1625 |
| MambaIRv2[40] |
30.2330 |
0.8339 |
6.4067 |
0.3120 |
8.2871 |
0.1586 |
| SIMSR (Ours) |
30.2738 |
0.8414 |
6.3091 |
0.3061 |
7.6905 |
0.1502 |
| fResident |
SRCNN[27] |
22.9982 |
0.6361 |
8.5432 |
0.4148 |
8.2945 |
0.1669 |
| VDSR[28] |
23.2386 |
0.6630 |
8.3945 |
0.4454 |
8.2801 |
0.1571 |
| SwinIR[17] |
23.4050 |
0.6661 |
8.4604 |
0.3976 |
8.0717 |
0.1560 |
| HAT[31] |
23.4675 |
0.6743 |
8.1616 |
0.4290 |
8.2068 |
0.1543 |
| MambaIR[39] |
23.4765 |
0.6749 |
8.4901 |
0.4248 |
8.2172 |
0.1541 |
| MambaIRv2[40] |
27.6244 |
0.6900 |
9.4581 |
0.4196 |
8.1094 |
0.1564 |
| SIMSR (Ours) |
27.6757 |
0.6955 |
8.0123 |
0.4127 |
7.8572 |
0.1508 |
| gParking |
SRCNN[27] |
23.2839 |
0.6139 |
7.5217 |
0.4232 |
7.7822 |
0.1560 |
| VDSR[28] |
23.5637 |
0.6429 |
6.8784 |
0.4400 |
7.7423 |
0.1573 |
| SwinIR[17] |
23.6548 |
0.6469 |
7.0965 |
0.3974 |
7.5371 |
0.1558 |
| HAT[31] |
23.8184 |
0.6568 |
6.7592 |
0.4190 |
7.6764 |
0.1553 |
| MambaIR[39] |
23.8386 |
0.6578 |
6.9988 |
0.4155 |
7.6839 |
0.1552 |
| MambaIRv2[40] |
25.0994 |
0.7680 |
7.3963 |
0.3538 |
7.4748 |
0.1545 |
| SIMSR (Ours) |
25.1659 |
0.7766 |
6.6415 |
0.3489 |
7.3752 |
0.1533 |
The proposed SIMSR further excels on the RSUAV-QH dataset when processing images degraded through complex quality reduction.
Figure 11 exemplifies this using images containing multiple buildings, where the low-resolution input exhibits severe local information loss after aggressive texture reduction. Comparative models generate blurred reconstructions with insufficient detail recovery, fundamentally failing to restore the distinct contours and shapes of subjects such as yaks. SIMSR overcomes these limitations through Test-Time Training integration, which also succeeds in expanding the effective receptive field to better capture global dependencies, just as illustrated in
Figure 1. This enables extraction of structurally coherent features that successfully reconstruct sharp object boundaries (e.g., yak silhouettes and building edges) while significantly improving overall image recognizability. Heat map comparisons confirm SIMSR’s precision in identifying key features within globally coherent contexts, directly translating to perceptually superior output sharpness.
5.5. Efficiency Study
To rigorously evaluate the computational efficiency of the proposed SIMSR architecture in UAV-based remote sensing super-resolution, we conduct comprehensive benchmark analyses against state-of-the-art Transformer baselines (SwinIR [
17], HAT [
31], MambaIR [
39], MambaIRv2 [
40]). These comparisons address critical UAV operational constraints where centimeter-resolution aerial imagery generates gigapixel-scale sequences exceeding computational limits of edge platforms , while demanding real-time processing for time-sensitive applications like precision agriculture and disaster response . Experiments on NVIDIA A100 40GB GPUs demonstrate SIMSR’s efficiency breakthroughs through its
geographically-chunked processing strategy, which optimizes hardware utilization while preserving essential spatial-semantic relationships unique to UAV oblique imagery .
The optimization trajectory reveals transformative gains across implementation paradigms. The Naive PyTorch implementation (plain arithmetic + autograd) incurs excessive recursive computational graphs, causing prohibitive training (12h 23m) and inference (1h 21m 10s) latency for UAV time-series analysis. Element-wise fused kernels (Triton FP32/BF16) partially mitigate this but fail to leverage GPU multithreading and cache locality , yielding suboptimal FLOPs (71.23G) and inference delays (13m 50s). In contrast, our
chunk-wise Triton kernel (BF16) exploits UAV-acquired spatial coherence by processing ecologically contiguous regions (e.g., agricultural plots, urban blocks) via batched GEMM operations,
reducing FLOPs by 32% (60.78G) and
accelerating inference by 10.85× (7m 29s) and training by 8.74×(1h 25m) versus naive implementations (
Table 3).
Comparative analysis against SOTA methods (
Table 4) demonstrates SIMSR’s superiority in UAV-relevant metrics. With lowest FLOPs (60.78G), fastest training (1h 25m), and real-time inference (7m 29s) at minimal parameters (2.12M), SIMSR achieves 73% faster inference than SwinIR—critical for processing UAV orthomosaics exceeding 10,000×10,000 pixels. The efficiency stems from two UAV-specific innovations: (1)
Unified Tensor (UT) transforms that compress oblique imaging geometry into low-rank representations; (2)
semantic-guided chunking that decomposes scenes into ecologically coherent units (e.g., watersheds, crop parcels ) for
parallel computation (
C=
), eliminating sequential bottlenecks in MambaIR variants while maintaining diagonal feature awareness essential for agricultural contours .
Memory optimization is paramount for UAV edge deployment. As profiled in
Table 5, SIMSR achieves 86.7% L2 cache hit rate (2× higher than MambaIRv2) and 78 GB/s bandwidth by aligning chunk access patterns with GPU cache lines and UAV scene layouts. This reduces DRAM accesses by 54% versus SwinIR, preventing out-of-memory crashes when processing continental-scale mosaics on <8GB embedded GPUs. The contiguous processing flow specifically benefits UAV multi-temporal stacks where spectral consistency across acquisitions is maintained through semantic anchoring.
5.6. Ablation Study and Comprehensive Analysis
To rigorously evaluate the contribution of each architectural component in SIMSR, we conduct ablation studies measuring performance through both quantitative metrics and qualitative assessments. These experiments systematically isolate and evaluate the impact of four key innovations: the Omni-Shift mechanism, SISM backbone, Channel Attention module, and 2D scanning strategy. The baseline is with naive self-attention backbone. Each component demonstrates significant and measurable contributions to the overall performance, validating our architectural choices through controlled comparisons against alternative implementations.
Ablation studies validate each component’s contribution to SIMSR’s performance. The Omni-Shift and Channel Attention mechanism achieves a 0.65 dB PSNR improvement and 4% LPIPS reduction over simpler shift strategies by modeling multi-scale spatial relationships through parallel convolutional pathways with varying kernels, effectively preserving structural details in complex geographical features. Our SISM backbone provides a 0.26 dB PSNR gain and 10% SAM reduction compared to conventional architectures, enabling continuous adaptation to input characteristics during inference for handling seasonal vegetation variations and illumination changes. The Semantic Decomposition module delivers a significant 4.16 dB PSNR improvement and 25% NIQE reduction. Finally, the 2D scanning strategy boosts SSIM by 3% and reduces SAM errors by 4.2% versus 1D methods by preserving bidirectional spatial relationships essential for reconstructing linear features and urban structures.
As evidenced in
Table 6, each component contributes cumulatively to the overall performance, with the complete SIMSR configuration achieving optimal results. The full model integrates all innovations synergistically, combining the Omni-Shift’s multi-scale feature extraction, SISM Backbone’s dynamic state adaptation, Channel Attention’s cross-channel modeling, and 2D scanning’s spatial relationship preservation. This comprehensive approach establishes new state-of-the-art performance while maintaining computational efficiency, demonstrating that the architectural innovations collectively address the fundamental challenges in remote sensing super-resolution.
Also, further comprehensive analyses are conducted to validate the design choices in our framework on the super-resolution task. The proposed Omni-Shift mechanism demonstrates clear advantages over simpler token shift approaches, improving PSNR by 0.65 dB and reducing LPIPS by over 4% compared to Quad-Shift and Uni-Shift while enhancing structural preservation. Our TTT backbone outperforms both ResNet and naive attention alternatives across all metrics, achieving 0.26 dB higher PSNR and 10% lower SAM than attention-based implementations, confirming its effectiveness in capturing spatial dependencies. For feature transformation, the channel attention module significantly surpasses standard MLP variants with various activation functions like Rectified Linear Unit (ReLU)[
54] and Gaussian Error Linear Unit (GELU)[
55], delivering 0.29 dB PSNR gain and 25% NIQE reduction while substantially improving perceptual quality. The 2D scanning strategy proves superior to conventional 1D methods, enhancing SSIM by 3% and reducing cross-channel distortion (SAM) by 4.2%, validating its ability to model complex spatial relationships. These controlled experiments collectively demonstrate that each proposed component contributes substantially to the overall performance gains.
Table 7.
Studies on Impacts of Multiple Components
Table 7.
Studies on Impacts of Multiple Components
| Component |
Method |
PSNR ↑ |
SSIM ↑ |
NIQE ↓ |
LPIPS ↓ |
RMSE ↓ |
SAM ↓ |
| Token Shift |
Uni-Shift |
26.1232 |
0.7133 |
6.5121 |
0.3211 |
6.4345 |
0.170245 |
| Quad-Shift |
26.3523 |
0.7299 |
6.3325 |
0.3189 |
6.4023 |
0.169702 |
| Omni-Shift (Ours) |
26.7741 |
0.7493 |
6.0509 |
0.3086 |
6.3994 |
0.165984 |
| Backbone |
ResNet |
26.4526 |
0.8101 |
5.6533 |
0.2576 |
6.1755 |
0.168675 |
| Naive Attention |
26.5205 |
0.8313 |
5.5299 |
0.2398 |
6.1466 |
0.165219 |
| SISM (Ours) |
26.6128 |
0.8469 |
5.2425 |
0.2167 |
6.1121 |
0.159541 |
| MLP Variants |
MLP(ReLU) |
30.5086 |
0.9066 |
5.6065 |
0.1752 |
4.7284 |
0.158447 |
| MLP(GELU) |
30.5653 |
0.9164 |
5.4276 |
0.1554 |
4.7006 |
0.156102 |
| ChannelAtt (Ours) |
30.7739 |
0.9574 |
4.8475 |
0.1119 |
4.6229 |
0.144746 |
| Scan Methods |
1D Scan |
27.4849 |
0.7789 |
6.1649 |
0.2913 |
6.2295 |
0.180285 |
| 2D Scan (Ours) |
27.5970 |
0.8023 |
5.8193 |
0.2635 |
6.1906 |
0.172631 |