Preprint
Article

This version is not peer-reviewed.

Sparse Self-Prompt Guided Stereo Matching for Real-World Generalization

Submitted:

08 April 2026

Posted:

09 April 2026

You are already at the latest version

Abstract
Stereo matching has witnessed rapid advances on curated benchmarks, yet deploying models in unconstrained real-world environments remains a fundamental challenge. This paper presents a sparse self-prompt guided network (SSPGNet) for stereo matching with strong generalization across diverse environments. Our core innovation lies in a sparse self-prompt guidance mechanism: 1) a sparse disparity map, used as a prompt, is self-estimated from visual foundation model features via cost aggregation; and 2) the sparse disparity is progressively refined into dense disparity maps through cross-attention-based stereo feature interaction, enabling sparse-to-dense disparity prediction. Additionally, we collected a diverse set of indoor and outdoor stereo pairs using a ZED 2 camera to assess the real-world performance of our model. Extensive experiments demonstrate that the proposed sparse-to-dense prompt mechanism not only preserves the semantic awareness of visual foundation models but also enhances stereo correspondence reasoning, achieving strong performance on public benchmarks and our in-the-wild dataset. These results highlight the potential of SSPGNet for direct deployment in real-world stereo perception systems. The code and data will be made publicly available upon publication.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Stereo matching has been studied for nearly half a century and remains a fundamental challenge in computer vision and robotics. Deep learning and large-scale datasets have driven significant progress, resulting in the proliferation of deep stereo matching models [1,2]. Although these models achieve state-of-the-art performance [3,4] on many public benchmarks—even without fine-tuning [5]—they are still rarely adopted in real-world applications. In other words, current methods excel at predicting disparity maps on specific benchmark datasets but still struggle in open-world scenarios. In contrast, foundation models have emerged in related areas and have demonstrated strong performance on in-the-wild images, such as DepthAnything [6] and Segment Anything [7]. Meanwhile, recent engineering-oriented studies in Sensors further highlight the practical value of stereo vision and disparity estimation in real-world ranging and deployment scenarios [8]. This raises a critical question: how can stereo matching achieve a level of real-world generalization comparable to that of DepthAnything [6]?
DepthAnything [6] suggests that scaling is a key factor in achieving zero-shot generalization, and its success benefits substantially from vision foundation models (VFMs). The performance improvement comes primarily from robust visual features extracted by VFMs such as DINO v2 [9], SAM [7], and DepthAnything [6]. Intuitively, stereo matching should benefit from such robust visual features if they can effectively capture pixel-level spatial information or provide a unified representation across two views. If this assumption holds, we could compare left-right features by constructing a cost volume and then regress the disparity map. However, in practice, we do not obtain the expected results because VFM tokens are spatially coarse and cannot directly preserve sufficient coordinate information for stereo correspondence.
This paper introduces a Sparse Self-Prompt Guided Network (SSPGNet) to address the challenges of stereo matching in dynamic, open-world environments. Our key innovation lies in incorporating geometric priors into the stereo matching network through a sparse self-prompt, thereby enhancing its robustness. Specifically, our approach begins by extracting robust features from VFMs and exploring the relationship between tokens and pixel coordinates. We then estimate a sparse disparity map from the left-right features. Using this sparse disparity as an initial prompt, we guide the model to focus its learning capacity on the most informative regions of the stereo pair. The second key component is cross-attention-based stereo feature interaction. We employ cross-shift window attention between stereo features to generate an affinity matrix, which progressively refines the sparse disparity into a dense map through spatial propagation. This design provides a more nuanced understanding of disparity relationships, enabling the model to handle complex, occluded, and real-world environments more effectively.
To validate the robustness and stability of SSPGNet, we evaluate our model on more than eight public datasets, including CreStereo [10], SceneFlow [11], KITTI 2012&2015 [12,13], ETH3D [14], Middlebury [15], and Flickr1024 [16]. Moreover, we collected a dataset of stereo pairs from diverse indoor and outdoor environments using a ZED 2, as shown in Figure 1. Through extensive experiments, we demonstrate that SSPGNet not only outperforms state-of-the-art methods on standard benchmarks but also achieves strong generalization on our in-the-wild dataset, highlighting its practical value. In summary, our work makes the following contributions:
  • We design a sparse self-prompt mechanism based on vision foundation models to achieve strong generalization for in-the-wild stereo matching.
  • We investigate whether tokens from vision foundation models can represent the coordinate information of all pixels within the original patch for stereo matching.
  • We collect a set of in-the-wild stereo pairs and evaluate our model on more than eight public datasets, highlighting its potential for direct deployment in real-world stereo perception systems.

3. Methodology

3.1. Overview

Given a left-right stereo pair I L , R R 3 × H × W , the goal of stereo matching is to estimate the correspondence information, namely the disparity map, between the stereo images, where H × W denotes the image size. In this paper, we aim to estimate disparity maps for stereo pairs captured in the wild. As shown in Figure 2, our model contains five core parts: a feature extraction module, a feature transform module, a cost volume construction module, a cost aggregation module, and a sparse prompt module.
First, we use a vision foundation model to extract token features T L , R R C × H / p × W / p with patch size p × p from stereo pairs, while keeping the parameters of the vision foundation model fixed throughout training. Next, we employ the feature transform module to reassemble features from different stages of the foundation model into large-scale features F L , R R C × 4 H / p × 4 W / p . Then, we construct a cost volume from the transformed features and feed it into a cost aggregation module based on a 3D hourglass network, generating high-confidence sparse or blurred disparity maps d ^ s , b R H × W . Finally, through cross-attention-based stereo feature interaction in the sparse prompt module, we progressively refine sparse disparity into dense disparity maps d ^ r R H × W , achieving sparse-to-dense disparity prediction. We introduce each part in the following sections.

3.2. Model Architecture

Feature Extraction Module. Vision foundation models have demonstrated strong performance on multiple tasks, especially on depth prediction. Therefore, we use the empirical configurations of vision foundation models (DINO v2 [9] or DepthAnything v2 [38]) as our feature extraction module. DepthAnything v2 uses DINO v2 based on ViT [41] as its encoder, so DepthAnything v2 and DINO v2 share the same backbone structure. Following DepthAnything v2, we extract tokens from stereo pairs I L , R at different layers from shallow to deep: T L , R ( i ) R C × H / p × W / p , i { 4 , 11 , 17 , 32 } , where the patch size p is set to 14. Meanwhile, to avoid catastrophic forgetting, we freeze the parameters of the vision foundation models and train only the remaining modules. Then, using the feature transform module, we convert the small-sized tokens into larger-scale features.
Feature Transform Module. Compared with conventional stereo features, the token resolution is too small, with only H 14 × W 14 . Therefore, we up-sample and fuse tokens from different layers to obtain features F L , R R C × 4 H / p × 4 W / p . The process can be written as:
F L , R ( i ) = FeatUp ( Conv ( T L , R ( i ) ) , I L , R ) ,
where Conv ( · ) denotes a convolution with kernel size 1 × 1 used to adjust the channel number, and FeatUp ( · ) denotes the feature up-sampling function guided by the left-right images I L , R . We then fuse these features to obtain the transformed features F L , R as follows:
F L , R = Conv ( [ F L , R ( i ) , ] ) , i { 4 , 11 , 17 , 32 } ,
where Conv ( · ) denotes a convolution with kernel size 1 × 1 used to fuse the features, and [ · , · ] denotes concatenation along the channel dimension. A natural question arises here: can a token represent the coordinate information of all pixels within the original patch, and if so, why is token up-sampling still necessary? Unlike other dense prediction tasks such as semantic segmentation or depth prediction, stereo matching is fundamentally a correspondence problem. Therefore, pixel-level coordinate information is crucial. To investigate this issue, we design a verification scheme. As shown in Figure 3, an alternative is to decode a smaller cost volume directly from the tokens T L , R using a series of 3D deconvolutions. Therefore, in the experiments, we compare constructing the cost volume from transformed features with constructing it directly from tokens.
Cost Volume Construction. Given the transformed features F L , R R C × 4 H / p × 4 W / p , we construct the cost volume C ( d , h , w ) R C × 4 D / p × 4 H / p × 4 W / p by concatenation as follows:
C ( d , h , w ) = [ F L ( h , w ) , F R ( h , w d ) ] ,
where [ · , · ] denotes concatenation along the channel dimension, D denotes the maximum disparity range, and d { 1 , 2 , , 4 D / p } is the disparity index.
Cost Aggregation Module. We use the cost aggregation module to obtain the initial disparity maps and probability distributions, thereby producing a high-quality sparse prompt. Following previous works, we adopt a 3D hourglass network [42] as the cost aggregation module. The cost aggregation process can be formulated as:
C ^ i = Hourglass ( C i ) , i { 1 , 2 , 3 } ,
P ^ i = Softmax ( C ^ i ) , i { 1 , 2 , 3 } ,
d ^ i = d = 1 4 D / p d · P ^ i ( d ) , i { 1 , 2 , 3 } ,
where C ^ i R D × H × W denotes the i-th cost volume after the 3D hourglass, P ^ i denotes the probability distribution of the i-th cost volume, and d ^ i represents the i-th disparity map.
Next, we use the probability distribution of the final stage to generate the sparse prompt. In real-world scenes, the probability distribution is often multi-modal [43], and previous work has proposed an effective distribution modeling method for this case. The multi-modal probability distribution P g t R D × H × W can be modeled as a mixture of Laplacians based on the ground-truth disparity d g t , as follows:
P g t ( d g t ) = k = 1 K w k · Laplacian μ k , b k ( d g t ) ,
where K denotes the number of disjoint subsets, and w k , μ k , and b k are the weight, mean, and scale parameters of the k-th Laplacian distribution, respectively. We use the default setting in [43] to supervise the predicted probability distributions. Then, we apply a confidence threshold t { 0.08 , 0.10 , 0.12 , 0.15 } to the probability distribution P ^ 3 of the final 3D hourglass output to obtain sparse disparity maps d ^ s as prompts. The filtering process is defined as:
d ^ s = f ( P ^ 3 , t ) · d ^ 3 ,
where f ( · ) denotes the indicator function.
Sparse Prompt Module. Recent works have emphasized that vision foundation models can extract structural information from the input. Meanwhile, adjacent pixels usually have similar disparity values. Therefore, we use such structural information to achieve sparse-to-dense propagation. First, we extract a latent feature space z from the left transformed feature F L ( i , j ) and the warped right transformed feature F R ( i d ^ , j ) through a Neural Markov Random Field (NMRF) based on cross-shift window attention, where i , j denote pixel coordinates. Then, we convert the latent feature space z into an affinity matrix A R 8 × H × W using a one-layer MLP. Finally, we embed the sparse disparity map d ^ s into a hidden representation H R 8 × H × W and update this hidden representation using the affinity matrix A , thereby obtaining the sparse-to-dense disparity map d ^ r through a spatial propagation network. The process can be written as:
z = NMRF ( F L ( i , j ) , F R ( i d ^ , j ) ) ,
A = MLP ( z ) ,
H = Embed ( d ^ s ) ,
d ^ r = SPN ( H , A ) .
Therefore, we obtain the dense disparity map after the sparse prompt module.

3.3. Loss Functions

We use the smooth L 1 loss L s to supervise the predicted disparity maps, as follows:
L s = i = 1 I = 4 smooth L 1 d g t d ^ i ,
where d g t denotes the ground-truth disparity map, d ^ i represents the i-th predicted disparity, and I denotes the total number of predicted maps. Meanwhile, we employ a cross-entropy loss L c e to supervise the probability distributions of the cost volume, as follows:
L c e = i = 1 I = 3 1 N d P g t ( d ) log P ^ i ( d ) ,
where P ^ i denotes the i-th predicted distribution, P g t is the ground-truth distribution obtained from the normalized multi-modal Laplacian distribution of d g t , and N denotes the number of pixels in the image. Thus, the total loss L t o t a l is defined as follows:
L t o t a l = L s + L c e .

4. Experiments

We conduct a comprehensive set of experiments to validate our proposed framework. Our evaluation is designed to answer three key questions: (1) How does our method perform against state-of-the-art techniques across standard benchmarks? (2) How do feature size and the quality of the generated sparse labels affect performance? (3) How does our method perform in the wild with a ZED 2 camera?

4.1. Datasets & Evaluation Metrics

Datasets. We briefly describe these datasets from different domains. 1) Source datasets. We train all models only on the SceneFlow dataset [11] or the CreStereo dataset [10]. 2) Target domain. Following previous works [18,37], we evaluate all models without fine-tuning on KITTI 2012&2015 (KT-12 and KT-15) [12,13], ETH3D (ET) [14], and Middlebury (MB) [15]. Meanwhile, we collect stereo pairs using a ZED 2 to provide real-world stereo images for testing performance in the wild.
Evaluation Metrics. We evaluate our model using two standard metrics. 1) End-point error (EPE), which computes the average absolute error between the predicted disparity map and the ground truth. 2) The τ -pixel error, which measures the percentage of pixels with an absolute error greater than τ pixels. The evaluation website further reports the predicted disparity map in different regions, including all pixels, non-occluded pixels, and occluded pixels, denoted as All, Noc, and Occ, respectively.

4.2. Main Properties

The Proposed Structure. We compare the core parts of our model, especially the feature transform module (FTM) and sparse prompt module (SPM). As shown in Table 1 and Figure 4, three observations can be made: 1) compared with decoding a small-size cost volume, feature transformation is important for models based on vision foundation models; 2) consistent with previous work, multi-scale aggregation helps the model perform better; and 3) the sparse prompt module further improves matching accuracy, demonstrating the effectiveness of our architecture. In addition, we infer that the key factor is that feature transformation reconstructs scale information, which is crucial for recovering location information. Large models convert patches into tokens, thereby reducing feature resolution and weakening positional cues. From this perspective, although large models provide strong semantic features, these features do not seem to contain sufficient positional information for stereo matching. Therefore, reconstructing coordinate information is essential when using large models for position-sensitive tasks.
Sparse Disparity Map Evaluation. We evaluate the sparse disparity maps produced by our model in both in-domain and cross-domain settings, as shown in Figure 5 and Figure 6. We also report quantitative results for sparse disparity in Table 2. Three conclusions can be drawn: 1) the sparse disparity maps are of high quality, with a 3-pixel error lower than 1.50 % in the best setting; 2) compared with other uncertainty-based methods, the sparse disparity maps generated by our model are denser and more accurate; and 3) when comparing in-domain and cross-domain sparse disparity maps, the sparsity and accuracy do not degrade significantly, indicating that our model is robust. Overall, benefiting from vision foundation models, our method can generate stable sparse disparity maps to guide the sparse-to-dense process.
Different Backbone. We compare models trained with two vision foundation models, DINO v2 and DepthAnything v2, on two synthetic datasets (CreStereo and SceneFlow). As shown in Table 3, three conclusions can be drawn: 1) both vision foundation models help improve performance in in-domain and cross-domain settings; 2) compared with DINO v2, the performance gain from using DepthAnything v2 is more significant; and 3) the larger model achieves better generalization. In addition, we speculate that depth prediction and stereo matching are more closely related tasks, which is why DepthAnything v2 helps the model perform better in 3D perception. Therefore, our final model uses DepthAnything v2 as the feature extractor.

4.3. Comparisons

In-the-Wild Generalization. We employ a ZED 2 camera to capture binocular images in both indoor and outdoor environments to evaluate the model’s performance in real-world scenarios, as illustrated in Figure 7. The results demonstrate that our model can generate clear and accurate disparity maps across diverse environmental conditions. Notably, it maintains robust performance even in challenging scenarios, such as occlusions and varying lighting conditions. Thus, our model exhibits strong generalization capability and effectively adapts to different backgrounds and scene complexities without requiring additional fine-tuning. This suggests strong potential for deployment in practical applications such as autonomous navigation, robotics, and augmented reality, which is consistent with the application trend of recent Sensors studies on deployable stereo systems [8]. More results in open environments are provided in the supplementary material. Overall, our model can reliably predict high-quality disparity maps in diverse settings.
Cross-Domain Comparison. We compare our model with state-of-the-art domain-generalized stereo matching methods and report quantitative cross-domain results on four public real-world datasets. Table 4 shows that our model achieves competitive peak cross-domain performance across the four datasets. Our model achieves the best performance on ETH3D and KITTI 2012&2015, and the second-best performance on Middlebury. Meanwhile, Figure 8 presents our predicted disparity maps alongside those of different methods, indicating that our model performs better at edges and fine details. Therefore, our model demonstrates strong generalization performance.
Volatility Comparison. There are large fluctuations in the results across different training epochs. Therefore, we should pay attention not only to peak cross-domain performance but also to result stability. Table 5 indicates that our model exhibits smaller fluctuations than other methods, meaning that it is more robust. We attribute these improvements to two main factors: 1) the large model provides robust feature representations, which help the model perform more stably; and 2) as shown in Figure 6, the predicted sparse disparity map remains stable in both density and accuracy, which helps ensure effective final results. Meanwhile, compared with the sparse disparity in Figure 5, we find that, regardless of whether the setting is in-domain or cross-domain, our model can provide an accurate sparse prompt to guide disparity aggregation, indicating strong generalization ability.

5. Discussion

Although we have implemented a stereo matching model for open environments, several issues remain unresolved. We have tested the model in the wild and observed strong generalization. However, two core issues still need to be addressed. 1) The disparity range of the current model is limited, for example, to 1 197 . If the target object is captured at close range, the true disparity may exceed 197. As shown in Figure 9 (b), the model cannot correctly handle regions whose disparity is larger than 197. 2) If we feed two identical images (e.g., two left images) into the network, the expected output should be a zero-disparity map. However, as shown in Figure 9 (f), the model still produces a disparity map with noticeable noise, suggesting that it may not fully capture the essential concept of stereo matching. Moreover, when we down-sample stereo pairs to fit the predefined disparity range, the model can recover a reasonable disparity map, as shown in Figure 9 (c). We also warp the left image using the predicted disparity map to reconstruct the right image, and Figure 9 (e) shows that the reconstructed right image is visually reasonable. In related tasks such as multi-view depth estimation, the search range in the cost volume can be adjusted for different depth ranges to obtain appropriate results. In our case, however, simply changing the disparity range still does not fully solve the problem. Overall, despite achieving good generalization, the network still does not seem to fully capture the essential concept of matching.
Future Work. The disparity range varies across scenes and focal lengths. Although some methods claim to identify disparity ranges dynamically, they are still limited to specific datasets. Thus, based on the work presented in this paper, we hope to further explore dynamic disparity-range perception and develop a stereo matching model that is more suitable for engineering applications. Meanwhile, we will further study how to help the model better capture abstract concepts such as three-dimensional geometry and stereo correspondence.

6. Conclusions

In this work, we propose a sparse self-prompt guided network for stereo matching. Our core innovation lies in a sparse self-prompt guidance mechanism based on vision foundation models to achieve better generalization for in-the-wild stereo matching. Meanwhile, we find that tokens from vision foundation models cannot fully preserve the coordinate information of all pixels within a patch for stereo matching. Additionally, we collect a diverse set of indoor and outdoor stereo pairs using a ZED 2 camera to assess real-world performance. Benefiting from this design, our model significantly improves zero-shot performance, including on stereo pairs collected in the wild. Extensive experiments validate the effectiveness of our approach: 1) our model achieves state-of-the-art performance in cross-domain generalization; 2) our method produces accurate sparse disparity maps in both in-domain and cross-domain settings; and 3) our in-the-wild experiments show that the method can be applied with a ZED 2 camera.

Author Contributions

Conceptualization, H.L. and Z.R.; methodology, H.L., H.M., and Z.R.; software, H.L. and H.M.; validation, H.L., X.L., and T.F.; formal analysis, H.L. and S.L.; investigation, H.L., T.F., and S.Y.; resources, Z.R.; data curation, H.L. and H.M.; writing—original draft preparation, H.L.; writing—review and editing, H.L., Z.R., and S.L.; visualization, H.L. and X.L.; supervision, Z.R.; funding acquisition, Z.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China (62401244, 62473187), and Jiangxi Natural Science Foundation, China (20244BCE52116, 20252BAC240225, 20252BAC200195).

Acknowledgments

The authors would like to thank the computer vision community for sharing valuable open-source implementations that supported this research.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A. Implementation Details

In this paper, we implement our model using PyTorch. According to our goal, we determine the hyperparameters and training schedule as follows.
Hyper-parameters. We set the maximum disparity to D = 196 to ensure that nearly all possible disparity values in the images can be covered. We use the Adam optimizer ( β 1 = 0.9 , β 2 = 0.999 ) to optimize the model. During training, we set a mini-batch size of 3 image pairs per GPU (18 on 6 GPUs).
Training schedule. We train models only on the SceneFlow or CreStereo dataset. The learning rate is initially set to 1 × 10 3 for 50 epochs and then reduced to 1 × 10 4 for the remaining 10 epochs. For data augmentation, we use random cropping, which randomly crops left-right images and the corresponding disparity map to 518 × 266 .

Appendix B. Additional Experiments

Table A1. Comparison of different pre-trained datasets. Different datasets affect cross-domain performance because of differences in disparity ranges and image styles.
Table A1. Comparison of different pre-trained datasets. Different datasets affect cross-domain performance because of differences in disparity ranges and image styles.
Pre-trained
dataset
KT-12
( > 3 px)
KT-15
( > 3 px)
MB
( > 2 px)
ET
( > 1 px)
SceneFlow [11] 4.6 4.7 7.9 2.5
Cre-Stereo [10] 3 . 6 4 . 2 7 . 6 2 . 1
Different Pre-training Datasets. First, we use our model with the DepthAnything v2 feature extraction module as the baseline. Then, we train our model on two large synthetic datasets (CreStereo and SceneFlow) to compare peak cross-domain performance, as listed in Table A1. Two conclusions can be drawn: 1) the pre-training dataset affects cross-domain performance because of differences in disparity ranges and image styles; and 2) among the two datasets, Cre-Stereo helps our model achieve better cross-domain performance, especially for in-the-wild generalization. Therefore, we use the model pre-trained on Cre-Stereo to compare with state-of-the-art methods.
Figure A1. Example results in the source domain. Our model predicts clear disparity maps on the pre-training datasets.
Figure A1. Example results in the source domain. Our model predicts clear disparity maps on the pre-training datasets.
Preprints 207170 g0a1
Figure A2. Sparse disparity in the source domain. It shows that the model can generate a high-quality prompt in the source domain.
Figure A2. Sparse disparity in the source domain. It shows that the model can generate a high-quality prompt in the source domain.
Preprints 207170 g0a2
More Results in the Source Domain. We present the final disparity and sparse disparity results in Figure A1 and Figure A2. Our model predicts clear disparity maps and high-quality prompts in the source domain. It is worth noting that, even in the source domain, our sparse prompt performs well in challenging areas such as occluded or repetitive-texture regions.
Figure A3. Example results in the wild. We collected stereo pairs in the wild using a ZED 2 camera to test the model’s generalization. The results show that our model can obtain good disparity predictions in the wild, indicating its practical applicability.
Figure A3. Example results in the wild. We collected stereo pairs in the wild using a ZED 2 camera to test the model’s generalization. The results show that our model can obtain good disparity predictions in the wild, indicating its practical applicability.
Preprints 207170 g0a3
More Results in the Wild. We show the predicted disparity maps and the reconstructed point cloud in the wild. As shown in Figure A3, our method preserves scene details more effectively. For example, the sword in the sculpture is recovered more completely in our disparity map, while it is partially missing in the results of other methods. Likewise, the disparity of the ground, which contains repetitive textures, is also better preserved by our method. These results further demonstrate the strong generalization ability of our model and its compatibility with ZED 2 cameras.

Appendix C. More Visualization Results

In the last section of the supplementary material, we provide qualitative results on in-the-wild scenes and public datasets, including KITTI 2012&2015, ETH3D, Middlebury, ZED, Dynamic, and Flickr1024, as shown from Figure A4 to the left side of Figure A11. Our model outputs sparse disparity maps with clear boundaries and high-quality dense disparity maps. We also present failure cases on the right side of Figure A11. These experimental results indicate that our method has strong generalization ability and the potential to be directly applied to binocular systems.
Figure A4. Results on the KITTI 2012 dataset.
Figure A4. Results on the KITTI 2012 dataset.
Preprints 207170 g0a4
Figure A5. Results on the KITTI 2015 dataset.
Figure A5. Results on the KITTI 2015 dataset.
Preprints 207170 g0a5
Figure A6. Results on the Middlebury dataset.
Figure A6. Results on the Middlebury dataset.
Preprints 207170 g0a6
Figure A7. Results on the ETH3D dataset.
Figure A7. Results on the ETH3D dataset.
Preprints 207170 g0a7
Figure A8. Comparison of results on the ZED dataset.
Figure A8. Comparison of results on the ZED dataset.
Preprints 207170 g0a8
Figure A9. Results on the Dynamic dataset.
Figure A9. Results on the Dynamic dataset.
Preprints 207170 g0a9
Figure A10. Results on the ZED dataset.
Figure A10. Results on the ZED dataset.
Preprints 207170 g0a10
Figure A11. Results and a failure case on the Flickr1024 dataset.
Figure A11. Results and a failure case on the Flickr1024 dataset.
Preprints 207170 g0a11

References

  1. Chen, C.; Zhao, L.; He, Y.; Long, Y.; Chen, K.; Wang, Z.; Hu, Y.; Sun, X. SemStereo: Semantic-Constrained Stereo Matching Network for Remote Sensing. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2025, 39, 15758–15766. [Google Scholar] [CrossRef]
  2. Rao, Z.; Dai, Y.; Shen, Z.; He, R. Rethinking Training Strategy in Stereo Matching. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 2023, 34, 7796–7809. [Google Scholar] [CrossRef] [PubMed]
  3. Liang, Z.; Li, C. Any-Stereo: Arbitrary-Scale Disparity Estimation for Iterative Stereo Matching. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2024, 38, 3333–3341. [Google Scholar] [CrossRef]
  4. Li, X.; Zhang, C.; Su, W.; Tao, W. II-Net: Implicit Intra-Inter Information Fusion for Real-Time Stereo Matching. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2024, 38, 3225–3233. [Google Scholar] [CrossRef]
  5. Zhou, J.; Zhang, H.; Yuan, J.; Ye, P.; Chen, T.; Jiang, H.; Chen, M.; Zhang, Y. All-in-One: Transferring Vision Foundation Models into Stereo Matching. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2025, 39, 10797–10805. [Google Scholar] [CrossRef]
  6. Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024; pp. 10371–10381. [Google Scholar]
  7. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023; pp. 4015–4026. [Google Scholar]
  8. Ruan, J.; Weng, H.; Yuan, Z.; Jin, G.; Zhou, L. Lightweight Stereo Vision for Obstacle Detection and Range Estimation in Micro-Mobility Vehicles. Sensors 2026, 26, 1988. [Google Scholar] [CrossRef] [PubMed]
  9. Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023. arXiv:2304.07193. [CrossRef]
  10. Li, J.; Wang, P.; Xiong, P.; Cai, T.; Yan, Z.; Yang, L.; Liu, J.; Fan, H.; Liu, S. Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022; pp. 16263–16272. [Google Scholar]
  11. Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016; pp. 4040–4048. [Google Scholar]
  12. Geiger, A.; Lenz, P.; Urtasun, R. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012; pp. 3354–3361. [Google Scholar]
  13. Menze, M.; Geiger, A. Object Scene Flow for Autonomous Vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015; pp. 3061–3070. [Google Scholar]
  14. Sch"ops, T.; Sch"onberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A Multi-View Stereo Benchmark with High-Resolution Images and Multi-Camera Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017; pp. 3260–3269. [Google Scholar]
  15. Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešić, N.; Wang, X.; Westling, P. High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. In Proceedings of the German Conference on Pattern Recognition, 2014; pp. 31–42. [Google Scholar]
  16. Wang, Y.; Wang, L.; Yang, J.; An, W.; Guo, Y. Flickr1024: A Large-Scale Dataset for Stereo Image Super-Resolution. In Proceedings of the International Conference on Computer Vision Workshops (ICCVW), 2019; pp. 3852–3857. [Google Scholar]
  17. Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-End Learning of Geometry and Context for Deep Stereo Regression. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017; pp. 66–75. [Google Scholar]
  18. Shen, Z.; Dai, Y.; Rao, Z. CFNet: Cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021; pp. 13906–13915. [Google Scholar]
  19. Zhao, H.; Zhou, H.; Zhang, Y.; Chen, J.; Yang, Y.; Zhao, Y. High-Frequency stereo matching network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023; pp. 1327–1336. [Google Scholar]
  20. Guan, T.; Wang, C.; Liu, Y.H. Neural Markov Random Field for Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024; pp. 5459–5469. [Google Scholar]
  21. Wang, X.; Xu, G.; Jia, H.; Yang, X. Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024; pp. 19701–19710. [Google Scholar]
  22. Zhou, J.; Ye, P.; Zhang, H.; Yuan, J.; Qiang, R.; YangChenXu, L.; Cailin, W.; Xu, F.; Chen, T. Consistency-Aware Self-Training for Iterative-Based Stereo Matching. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025, 16641–16650. [Google Scholar]
  23. Zhang, J.; Wang, X.; Bai, X.; Wang, C.; Huang, L.; Chen, Y.; Gu, L.; Zhou, J.; Harada, T.; Hancock, E.R. Revisiting Domain Generalized Stereo Matching Networks from a Feature Consistency Perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022; pp. 13001–13011. [Google Scholar]
  24. Rao, Z.; Xiong, B.; He, M.; Dai, Y.; He, R.; Shen, Z.; Li, X. Masked representation learning for domain generalized stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023; pp. 5435–5444. [Google Scholar]
  25. Zhang, Y.; Wang, L.; Li, K.; Wang, Y.; Guo, Y. Learning Representations from Foundation Models for Domain Generalized Stereo Matching. In Proceedings of the European Conference on Computer Vision (ECCV), 2024; pp. 146–162. [Google Scholar]
  26. Wen, B.; Trepte, M.; Aribido, J.; Kautz, J.; Gallo, O.; Birchfield, S. FoundationStereo: Zero-Shot Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025; pp. 5249–5260. [Google Scholar]
  27. Bartolomei, L.; Tosi, F.; Poggi, M.; Mattoccia, S. Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025, 1013–1027. [Google Scholar]
  28. Jiang, H.; Lou, Z.; Ding, L.; Xu, R.; Tan, M.; Jiang, W.; Huang, R. DEFOM-Stereo: Depth Foundation Model Based Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025; pp. 21857–21867. [Google Scholar]
  29. Cheng, J.; Liu, L.; Xu, G.; Wang, X.; Zhang, Z.; Deng, Y.; Zang, J.; Chen, Y.; Cai, Z.; Yang, X. MONSter: Marrying Monodepth to Stereo Unleashes Power. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025; pp. 6273–6282. [Google Scholar]
  30. Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual Prompt Tuning. In Proceedings of the European Conference on Computer Vision (ECCV), 2022; pp. 709–727. [Google Scholar]
  31. Shen, Z.; Song, X.; Dai, Y.; Zhou, D.; Rao, Z.; Zhang, L. Digging into Uncertainty-Based Pseudo-Labels for Robust Stereo Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2023, 45, 14301–14320. [Google Scholar] [CrossRef] [PubMed]
  32. Yang, S.; Wu, J.; Liu, J.; Li, X.; Zhang, Q.; Pan, M.; Gan, Y.; Chen, Z.; Zhang, S. Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024; pp. 16334–16342. [Google Scholar]
  33. Guo, Y.; Yang, C.; Rao, A.; Agrawala, M.; Lin, D.; Dai, B. SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models. In Proceedings of the European Conference on Computer Vision (ECCV), 2024; pp. 330–348. [Google Scholar]
  34. Li, H.; Liu, H.; Hu, D.; Wang, J.; Oguz, I. PRISM: A Promptable and Robust Interactive Segmentation Model with Visual Prompts. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2024; pp. 389–399. [Google Scholar]
  35. Huang, Z.; Yu, H.; Shentu, Y.; Yuan, J.; Zhang, G. From Sparse to Dense: Camera Relocalization with a Scene-Specific Detector from Feature Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025; pp. 27059–27069. [Google Scholar]
  36. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022; pp. 16000–16009. [Google Scholar]
  37. Liu, B.; Yu, H.; Qi, G. GraftNet: Towards Domain Generalized Stereo Matching with a Broad-Spectrum and Task-Oriented Feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022; pp. 13012–13021. [Google Scholar]
  38. Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything V2. 2025, 37, 21875–21911. [Google Scholar]
  39. Liu, Z.; Qiao, L.; Chu, X.; Ma, L.; Jiang, T. Towards an Efficient Foundation Model for Zero-Shot Amodal Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025, 20254–20264. [Google Scholar]
  40. Liang, Y.; Hu, Y.; Shao, W.; Fu, Y. Distilling Monocular Foundation Models for Fine-Grained Depth Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025; pp. 22254–22265. [Google Scholar]
  41. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021; pp. 1–21. [Google Scholar]
  42. Liu, B.; Yu, H.; Long, Y. Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 2022, 36, 1647–1655. [Google Scholar] [CrossRef]
  43. Xu, P.; Xiang, Z.; Qiao, C.; Fu, J.; Pu, T. Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024; pp. 5135–5144. [Google Scholar]
Figure 1. Example predictions on in-the-wild stereo pairs. We show stereo pairs, sparse self-prompts, disparity maps, and reconstructed 3D models, demonstrating that our model predicts high-quality disparities and generalizes well to in-the-wild stereo pairs. More results on in-the-wild scenes and public datasets are provided in the supplementary materials.
Figure 1. Example predictions on in-the-wild stereo pairs. We show stereo pairs, sparse self-prompts, disparity maps, and reconstructed 3D models, demonstrating that our model predicts high-quality disparities and generalizes well to in-the-wild stereo pairs. More results on in-the-wild scenes and public datasets are provided in the supplementary materials.
Preprints 207170 g001
Figure 2. Pipeline of the proposed method. We first use stereo features extracted by the foundation model to generate a sparse disparity prompt. Then, we use this prompt to guide the sparse prompt module, achieving sparse-to-dense disparity prediction.
Figure 2. Pipeline of the proposed method. We first use stereo features extracted by the foundation model to generate a sparse disparity prompt. Then, we use this prompt to guide the sparse prompt module, achieving sparse-to-dense disparity prediction.
Preprints 207170 g002
Figure 3. Two ways to use tokens from foundation models. One way is to up-sample the tokens before cost aggregation. The other is to up-sample the cost volume after cost aggregation.
Figure 3. Two ways to use tokens from foundation models. One way is to up-sample the tokens before cost aggregation. The other is to up-sample the cost volume after cost aggregation.
Preprints 207170 g003
Figure 4. Example results in the source domain. Our model predicts clear disparity maps on the pre-training datasets.
Figure 4. Example results in the source domain. Our model predicts clear disparity maps on the pre-training datasets.
Preprints 207170 g004
Figure 5. Sparse disparity in the source domain. It shows that the model can generate a high-quality prompt in the source domain.
Figure 5. Sparse disparity in the source domain. It shows that the model can generate a high-quality prompt in the source domain.
Preprints 207170 g005
Figure 6. Sparse disparity in the cross-domain setting. It shows that the model can generate a high-quality prompt in the cross-domain setting.
Figure 6. Sparse disparity in the cross-domain setting. It shows that the model can generate a high-quality prompt in the cross-domain setting.
Preprints 207170 g006
Figure 7. Example results in the wild. We collected stereo pairs in the wild using a ZED 2 camera to test the model’s generalization. The results show that our model can obtain good disparity predictions in the wild, indicating its practical applicability.
Figure 7. Example results in the wild. We collected stereo pairs in the wild using a ZED 2 camera to test the model’s generalization. The results show that our model can obtain good disparity predictions in the wild, indicating its practical applicability.
Preprints 207170 g007
Figure 8. Predicted disparity in the cross-domain setting. It shows that our model predicts a dense disparity map in the cross-domain setting.
Figure 8. Predicted disparity in the cross-domain setting. It shows that our model predicts a dense disparity map in the cross-domain setting.
Preprints 207170 g008
Figure 9. Limitations of the proposed model. It shows that: 1) our model cannot handle stereo pairs with a very large disparity range; and 2) when identical images are fed into the model, the output still contains noise.
Figure 9. Limitations of the proposed model. It shows that: 1) our model cannot handle stereo pairs with a very large disparity range; and 2) when identical images are fed into the model, the output still contains noise.
Preprints 207170 g009
Table 1. Ablation study of our model. It shows that our model improves performance in both in-domain and cross-domain settings. Preprints 207170 i001
Table 1. Ablation study of our model. It shows that our model improves performance in both in-domain and cross-domain settings. Preprints 207170 i001
Table 2. Accuracy under different confidence thresholds. It shows that the model predicts accurate sparse disparity maps.
Table 2. Accuracy under different confidence thresholds. It shows that the model predicts accurate sparse disparity maps.
t Cre
> 3 px
SF
> 3 px
KT-12
> 3 px
KT-15
> 3 px
MB
> 2 px
ET
> 1 px
0.08 1.9 2.1 4.6 4.0 8.7 2.7
0.10 1.6 1.9 3.7 3.4 8.2 2.5
0.12 1.4 1.6 2.8 2.7 8.1 2.3
0.15 1 . 2 1 . 5 1 . 6 1 . 8 7 . 8 1 . 7
Table 3. Comparison of different backbones. It indicates that DepthAnything v2 is more suitable for stereo matching tasks.
Table 3. Comparison of different backbones. It indicates that DepthAnything v2 is more suitable for stereo matching tasks.
Backbone Size SF ( > 3 px) KT-12 ( > 3 )
CNN - 3.90 4.98
DINO v2 S 3.65 4.84
DINO v2 L 3.22 4.54
DepthAnything v2 S 3.61 3.79
DepthAnything v2 L 3.20 3.62
Table 4. Peak cross-domain generalization evaluation. It shows that our model achieves state-of-the-art cross-domain performance.
Table 4. Peak cross-domain generalization evaluation. It shows that our model achieves state-of-the-art cross-domain performance.
Method KT-12
( > 3 px)
KT-15
( > 3 px)
MB
( > 2 px)
ET
( > 1 px)
HVT-RAFT 3.7 5.2 10.4 3.0
NMRF 4.2 4.9 7 . 5 3.8
Selective-IGEV 4.5 5.6 9.2 5.7
Former-RAFT-DAM 3.9 5.1 8.1 3.3
DEFOM-Stereo 3.7 4.9 11.9 2.3
CST-IGEV 4.1 5.2 9.8 4.2
SSPGNet 3 . 6 4 . 4 7.6 2 . 1
Table 5. Volatility evaluation of cross-domain generalization. It shows that our model is more stable across different epochs.
Table 5. Volatility evaluation of cross-domain generalization. It shows that our model is more stable across different epochs.
Method KT-12
( > 3 px)
KT-15
( > 3 px)
MB
( > 2 px)
ET
( > 1 px)
PSMNet 10.4 ± 3.50 14.70 ± 1.03 22.40 ± 7.54 17.60 ± 1.46
GANet 8.90 ± 0.75 12.10 ± 1.19 19.10 ± 9.24 12.10 ± 1.23
DSMNet 5.90 ± 1.14 6.60 ± 0.18 19.70 ± 3.64 7.12 ± 1.65
GF-PSMNet 6.00 ± 0.89 5.70 ± 0.58 17.80 ± 1.87 12.30 ± 1.59
LacGwcNet 9.17 ± 10.46 8.37 ± 9.24 18.28 ± 0.51 7.99 ± 1.37
CFNet 5.82 ± 0.13 6.56 ± 0.19 15.16 ± 0.90 7.24 ± 0.30
Mask-CFNet 5.03 ± 0.03 6.08 ± 0.07 12.82 ± 0.37 6.63 ± 0.21
SSPGNet 3 . 65 ± 0 . 01 4 . 45 ± 0 . 03 7 . 66 ± 0 . 05 2 . 29 ± 0 . 20
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated