4.2. Zhangjiakou (ZJK)
The ZJK dataset is a high-spatial and high-spectral () benchmark dataset established by our team, specifically designed for fine-grained tree species identification in complex urban-fringe and "forest-agriculture" mosaic ecosystems.
4.2.1. Study Area Overview and Environmental Background
The ZJK dataset was collected in Donghuayuan Town, Huailai County, Zhangjiakou City, Hebei Province, with geographical coordinates of N, E and an elevation of 470–480 meters. The region is situated within the Huailai Basin and is characterized by a typical temperate continental monsoon climate, featuring four distinct seasons and concurrent rain and heat. The total area of the study site is approximately 400 mu, with flat micro-topography. The land cover encompasses artificial forests, nurseries, and seasonal agricultural crops, presenting typical transitional landscape characteristics. This provides a complex and diverse ecological background for tree species identification.
Figure 5.
Geographical location and environmental setting of the study area: (left) topographic map of Huailai County, Hebei Province; (right) schematic diagram of the ZJK dataset site in Donghuayuan Town.
Figure 5.
Geographical location and environmental setting of the study area: (left) topographic map of Huailai County, Hebei Province; (right) schematic diagram of the ZJK dataset site in Donghuayuan Town.
4.2.2. Spatial-Spectral Data Acquisition Platform
The imagery was acquired in October 2025 using an ultra-high-resolution multi-source remote sensing platform. A DJI Matrice 350 RTK UAV, equipped with an X20P-LIR integrated multi-source imaging system, served as the flight platform. To minimize shadow interference and ensure a high signal-to-noise ratio (SNR), flight missions were conducted between 11:00 and 14:00 under clear, cloudless weather conditions.
The UAV flight altitude was set at 80 m. The acquired hyperspectral imagery covers a spectral range of 350–1000 nm (consisting of 164 bands) with an image size of pixels and a spatial resolution of approximately 0.027 m. Simultaneously, RGB imagery with 26 million pixels and thermal infrared (TIR) imagery with a resolution of pixels were also obtained.
Figure 6.
The data acquisition platform and flight planning: (a) X20P-LIR system mounted on the DJI M300/350 RTK multi-rotor UAV; (b) Predefined flight route map.
Figure 6.
The data acquisition platform and flight planning: (a) X20P-LIR system mounted on the DJI M300/350 RTK multi-rotor UAV; (b) Predefined flight route map.
4.2.3. Systematic Preprocessing and Ground Truth Construction
Following professional remote sensing standards, the raw hyperspectral data underwent a streamlined preprocessing and validation pipeline:
(1) Radiometric and Geometric Correction: Raw Digital Number (DN) values were converted to surface reflectance using laboratory calibration parameters and the empirical line method with reference panels. Geometric distortions were corrected by integrating high-precision GNSS/IMU data with a digital elevation model (DEM) to achieve sub-decimeter spatial registration, ensuring the geometric fidelity of the extracted spectral curves.
(2) Collaborative Annotation and Sample Generation: To manage large-scale imagery, a spatial partitioning strategy was employed where multiple annotators performed pixel-level delineation using Labelme. The scattered JSON annotation files were then integrated and processed via a self-developed data formatting tool. This tool performed coordinate transformation, topological stitching, and automated cropping, effectively transforming raw polygons into a seamless, standardized, and AI-ready dataset.
(3) Field-to-Image Verification: To ensure taxonomic integrity, systematic plots were established for concurrent botanical surveys (identifying species, tree height, and DBH). These ground records were cross-validated with the annotated polygons to exclude ambiguous samples caused by shadows or spectral mixing, guaranteeing the botanical reliability of the final ground truth labels for fine-grained classification.
4.2.4. Representative Sub-Scenes: HL_1 and HL_2
The complete ZJK dataset currently covers 20 land-cover categories, with various tree species serving as the core components, encompassing a total of 4,857 field-surveyed annotation samples. To rigorously verify the performance of MambaHSINet in fine-grained tree species identification, two representative sub-scenes, HuaiL_1 (HL_1) and HuaiL_2 (HL_2), were strategically selected from the full dataset as benchmark testing scenarios:
- 1.
HL_1 dataset: This scene represents a complex urban-fringe "forest-building" composite landscape, containing 7 land-cover categories with a total of 450,450 labeled pixels. It is characterized by the interlocking distribution of typical tree species, such as
Malus spectabilis and
Pinus tabuliformis, with artificial structures. Detailed information regarding the number of categories and pixel statistics is provided in
Table 2.
Figure 7.
HL_1 dataset. (left) RGB image. (right) Ground truth map and color legends.
Figure 7.
HL_1 dataset. (left) RGB image. (right) Ground truth map and color legends.
- 2.
HL_2 dataset: This scene captures a typical "artificial forest-farmland" transition zone, consisting of 6 major categories with 328,600 labeled pixels. The core difficulty of HL_2 lies in the extreme spectral similarity among taxonomically related coniferous species, particularly
Pinus tabuliformis,
Picea asperata, and
Platycladus orientalis. Detailed information regarding the number of categories and pixel statistics is provided in
Table 3.
Figure 8.
HL_2 dataset. (left) RGB image. (right) Ground truth map and color legends.
Figure 8.
HL_2 dataset. (left) RGB image. (right) Ground truth map and color legends.
4.2.5. Spectral Characteristic Analysis
The mean spectral reflectance curves for the
HL_1 and
HL_2 datasets are illustrated in
Figure 9 and
Figure 10, respectively.
In the HL_1 scene, the land-cover categories exhibit distinctive spectral signatures. The prominent "red edge" effect and near-infrared plateau features provide a solid physical foundation for the fine-grained identification of different tree species, such as Malus spectabilis and Pinus tabuliformis.
In contrast, the
HL_2 scene presents a more significant classification challenge. As shown in
Figure 10, the spectral profiles of
Crabapple (C1),
Populus tomentosa (C2), and
Leaf litter (C4) are highly bundled and overlapping across the entire range of 164 bands, particularly within the 100–140 band interval. This extreme intra-class spectral similarity—often referred to as the "same spectrum, different objects" phenomenon—constitutes the core difficulty of this dataset.
4.4. Comparison with State-of-the-Art Methods
To evaluate MambaHSINet, we compared it with traditional (SVM), CNN-based (Tri-CNN, HybridSN), Transformer-based (SSFTT), and SSM-based (MambaHSI, 3DSS-Mamba, SS-Mamba) methods. Quantitative results for the
HL_1,
HL_2, and
Paviadatasets are summarized in
Table 5,
Table 6, and
Table 7.
Table 8 distinguishes MambaHSINet from existing Mamba-based SOTA. While methods like SS-Mamba and MambaHSI focus on local patch enhancement via SSM, MambaHSINet introduces a structural shift to full-image bidirectional modeling. By explicitly decoupling spectral-spatial branches, our architecture captures wide-range zonal dependencies and eliminates the sliding-window redundancy inherent in patch-based SSMs, ensuring a non-incremental advancement in both efficiency and global context awareness.
Figure 11.
Overview of annotated images and classification performance of SVM, Tri-CNN, HybridSN, SSFTT, MambaHSI, 3DSS-Mamba, SS-Mamba, and the proposed MambaHSINet on the HuaiL_1 dataset.
Figure 11.
Overview of annotated images and classification performance of SVM, Tri-CNN, HybridSN, SSFTT, MambaHSI, 3DSS-Mamba, SS-Mamba, and the proposed MambaHSINet on the HuaiL_1 dataset.
Figure 12.
Overview of annotated images and classification performance of SVM, Tri-CNN, HybridSN, SSFTT, MambaHSI, 3DSS-Mamba, SS-Mamba, and the proposed MambaHSINet on the HuaiL_2 dataset.
Figure 12.
Overview of annotated images and classification performance of SVM, Tri-CNN, HybridSN, SSFTT, MambaHSI, 3DSS-Mamba, SS-Mamba, and the proposed MambaHSINet on the HuaiL_2 dataset.
Figure 13.
Overview of annotated images and classification performance of SVM, Tri-CNN, HybridSN, SSFTT, MambaHSI, 3DSS-Mamba, SS-Mamba, and the proposed MambaHSINet on the Pavia dataset.
Figure 13.
Overview of annotated images and classification performance of SVM, Tri-CNN, HybridSN, SSFTT, MambaHSI, 3DSS-Mamba, SS-Mamba, and the proposed MambaHSINet on the Pavia dataset.
Table 5 presents the classification results on the
HL_1 dataset. The proposed MambaHSINet achieves the state-of-the-art performance with OA, AA, and Kappa reaching 99.50%, 99.62%, and 99.39%, respectively. Compared with the strong 3D-CNN baseline HybridSN, our method improves OA and AA by 0.40% and 0.99%, respectively. This demonstrates that for complex forest scenes with extremely high spectral redundancy, our dual-branch structure can better decouple spectral and spatial features than pure convolutional networks. Furthermore, MambaHSINet outperforms SS-Mamba, demonstrating its superior capability in modeling long-range dependencies for fine-grained tree species classification.
Figure 14.
Detailed visualization of the long-straight boundary issue in the HL_1 dataset.
Figure 14.
Detailed visualization of the long-straight boundary issue in the HL_1 dataset.
In the HL_1 dataset, Category 5 (C5) and Category 6 (C6) both represent classes with long-straight boundaries that are cross-arranged and interfere with each other. Comparative experimental results indicate that the original MambaHSI method achieves accuracies of 93.12% for C5 and 95.93% for C6, both of which are lower than other methods such as SVM (97.89% and 98.12%) and HybridSN (99.76% and 99.31%).
This phenomenon reveals a significant long-straight boundary issue: when C5 and C6 are cross-arranged with straight edges, heterogeneous pixel sequence patterns frequently appear within the same scanning line. This poses an interference to the state update mechanism of Mamba, leading to the mutual penetration of feature representations between classes on both sides of the boundary area.
MambaHSINet improved the accuracy to 99.51% for C5 and 98.94% for C6, demonstrating that optimizations to the dual-branch architecture—specifically deeper residual connections and enhanced Squeeze-and-Excitation (SE) attention mechanisms—effectively alleviate the boundary confusion problem. Consequently, the overall accuracy (OA) reached 99.48% and the average accuracy (AA) reached 99.53%, outperforming all comparative methods.
Table 6 reports the classification results on the HL_2 dataset. The proposed MambaHSINet achieves the highest performance with a robust OA of 98.39%, maintaining a significant lead over other advanced models such as the Transformer-based SSFTT (95.04%) and the Mamba-based SS-Mamba (95.82%). Specifically, MambaHSINet improves the OA by 3.35% and 2.57% compared to these two methods, respectively. While HybridSN shows a slight advantage in class C4 (98.36%), our method achieves the highest accuracy in all other five categories, particularly exceeding 99.7% in classes C5 and C6. This consistent superiority across diverse categories confirms that our dual-branch architecture is highly adaptable to datasets with varying spatial scales and complex category distributions, effectively capturing both local textures and long-range dependencies.
Figure 15.
Detailed visualization of the long-straight boundary issue in the HL_2 dataset.
Figure 15.
Detailed visualization of the long-straight boundary issue in the HL_2 dataset.
In the HL_2 dataset, Category 4 (C4) is significantly interfered with by the adjacent long-straight boundary class C6, leading to feature confusion within the spatial branch. The classification accuracy of the original MambaHSI on C4 is only 89.26%, whereas all other comparative methods achieve over 96%, representing a highly significant discrepancy.
Notably, the low accuracy of C4 is not an isolated case; the overall accuracy (OA) and average accuracy (AA) of the original MambaHSI are only 91.17% and 91.76%, respectively, both of which are markedly lower than those of other methods. This indicates that the interference from long-straight boundaries in the HL_2 dataset has a global impact. Due to the presence of long-straight boundary features across multiple categories, the feature confusion at the boundaries within the spatial branch generates a cross-class propagation effect, resulting in a systematic degradation of the model’s overall performance.
MambaHSINet improved the accuracy of C4 to 97.44%, the OA to 98.39%, and the AA to 98.58%, further validating the effectiveness of the architectural optimization in mitigating long-straight boundary issues.
Table 7 presents the results on the Pavia dataset. MambaHSINet achieves the highest overall performance with an OA of 99.05% and AA of 97.16%, outperforming all comparative methods. While our model maintains a robust balance across most categories, specific class-wise variations reveal the architectural sensitivities of different models.
Figure 16.
Detailed visualization of the long-straight boundary issue in the Pavia dataset.
Figure 16.
Detailed visualization of the long-straight boundary issue in the Pavia dataset.
In the Pavia dataset, Category 3 (C3) occupies a relatively small area, resulting in significant performance variations across different algorithms, primarily constrained by statistical fluctuations due to limited training samples. In contrast, Category 6 (C6) consistently exhibits long-straight boundary characteristics and is subjected to strong interference from surrounding objects.
The classification results for C6 reveal a striking disparity among the algorithms: 3DSS-Mamba achieved only 0.53% (rendering it nearly ineffective), while the original MambaHSI reached 97.10% and Tri-CNN attained 99.30%. This substantial gap demonstrates that classes with long-straight boundaries are critical scenarios that lead to performance divergence among different model architectures. Specifically, the combination of 3D convolution and Mamba in 3DSS-Mamba caused severe feature confusion in this context.
MambaHSINet achieved 96.10% on C6, with an overall accuracy (OA) of 98.65% and an average accuracy (AA) of 97.16%. Its superior comprehensive performance across all comparative methods indicates that the proposed architectural improvements possess a certain degree of resistance to long-straight boundary interference.