Submitted:
03 June 2026
Posted:
04 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Principles and Applications of SAM
2.2. Principles and Applications of FoundationPose
2.3. SAM-FoundationPose Closed-Loop Fusion Framework
2.3.1. Overall Framework Design
2.3.2. Dynamic Prompt-Based SAM Segmentation Module
2.3.3. FoundationPose Pose Estimation Module
2.3.4. Multi-Dimensional Confidence Assessment Module
2.3.5. Iterative Optimization Control
3. Experiments and Analysis
3.1. Platform Setup and Experimental Preparation
3.2. Object Pose Estimation in Cluttered but Non-Stacked Scenes
3.2.1. Robustness Analysis
3.2.2. Positioning Accuracy Analysis
3.3. Object Pose Estimation in Cluttered and Stacked Scenes
3.3.1. Closed-Loop Iterative Correction Verification
3.3.2. Closed-Loop Iterative Correction Verification
3.4. Comparative Experiments with Different Estimation Methods
3.4.1. Selection of Comparative Algorithms and Evaluation Metrics
3.4.2. Result Comparison and Analysis
3.4.3. Visual Comparison of Results Across Different Methods
3.5. Core Module Ablation Experiments
3.5.1. Ablation Variant Design
- (1)
- Variant A (FoundationPose): Only the basic FoundationPose network is used without introducing any additional modules. This variant serves as the performance baseline, reflecting the inherent performance upper bound of the basic method under complex stacking backgrounds and providing a unified reference standard for all subsequent comparisons.
- (2)
- Variant B (FoundationPose+SAM): Based on Variant A, the SAM segmentation module is introduced solely at the input end to provide an initial foreground mask, while maintaining an open-loop process without subsequent iterative optimization. This variant is used to verify the suppression effect of the SAM segmentation module on background interference.
- (3)
- Variant C (FoundationPose+SAM+Iterative closed-loop): Based on Variant B, a closed-loop iterative mechanism is introduced; however, only a single "mask matching degree (IoU)" metric is used for confidence assessment, without adopting the multi-dimensional evaluation system proposed in this paper. This variant is used to verify the optimization capability of the closed-loop iterative mechanism and to compare the limitations imposed by single-metric evaluation on the iterative process.
- (4)
- Variant D (FoundationPose+SAM+Iterative closed-loop+Multi-dimensional confidence assessment): The complete framework proposed in this paper, incorporating SAM segmentation, rendering-guided closed-loop iterative feedback, and multi-dimensional confidence assessment. This variant is used to verify the overall performance under the synergistic cooperation of all modules and to demonstrate the gain effect of multi-dimensional assessment on the closed-loop iteration.
3.5.2. Ablation Experiment Results and Analysis
3.5.3. Ablation Experiment Conclusion
3.6. Verification of Robotic Arm Visual Grasping in Real-World Scenarios
3.6.1. Grasping Experiment Procedure
3.6.2. Grasping Result Analysis
4. Discussion
4.1. Innovations of This Work
4.2. Limitations of This Work
- (1)
- High-reflectivity interference: The earphone case and corn bottle surfaces exhibit multi-source specular reflection interference and complex internal refractive textures, respectively. These optical noises can mislead the infrared projection of the depth camera, causing partial loss of point clouds or severe depth distortion, which in turn leads to erroneous judgments by the 3D geometric confidence assessment module.
- (2)
- Low edge contrast: Objects such as the earphone case have smooth and reflective surfaces with low-contrast edges. In such cluttered backgrounds, although SAM possesses powerful semantic segmentation capability, its response to the edges of transparent materials is extremely weak. This results in the prior mask extracted at the front end lacking critical topological features, ultimately causing minor physical collisions or slippage of the robotic arm’s end effector during closure.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lee, J.; Kim, H.; Kwon, J. W.; Yun, S. J.; Lee, N. H.; Choi, Y. H.; Chung, G.; Suh, J., Model-Free Transformer Framework for 6-DoF Pose Estimation of Textureless Tableware Objects. Sensors (Basel, Switzerland) 2025, 25, (19), 6167-6167. [CrossRef]
- Sampath, S. K.; Wang, N.; Yang, C.; Wu, H.; Liu, C.; Pearson, M., A Vision-Guided Deep Learning Framework for Dexterous Robotic Grasping Using Gaussian Processes and Transformers †. Applied Sciences 2025, 15, (5), 2615-2615. [CrossRef]
- Sun, H.; Zhang, Y.; Sun, H.; Hashimoto, K., Refined Prior Guided Category-Level 6D Pose Estimation and Its Application on Robotic Grasping. Applied Sciences 2024, 14, (17), 8009-8009. [CrossRef]
- Wang, Y.; Wu, T.; Zou, Q., 6DoF Pose Estimation of Transparent Objects: Dataset and Method. Sensors 2026, 26, (3), 898-898. [CrossRef]
- Lou, Y.; Zhao, L.; Sui, N.; Gao, X.; Chen, Z.; Zhang, Y., 6D pose estimation method based on hybrid attention mechanism and vector-based local consistency enhancement. Engineering Research Express 2026, 8, (9), 095407. [CrossRef]
- Zheng, D.; Chen, Y., Enhancing Robotic Grasping Detection Using Visual–Tactile Fusion Perception. Sensors 2026, 26, (2), 724-724. [CrossRef]
- Zhang, X.; Chen, Y.; Lai, H.; Zhang, H., Weakly supervised 3D human pose estimation based on PnP projection model. Pattern Recognition 2025, 163, 111464-111464. [CrossRef]
- Wang, Y.; Li, H.; Luo, C., Object Pose Estimation Based on Multi-precision Vectors and Seg-Driven PnP. International Journal of Computer Vision 2024, 133, (5), 1-15. [CrossRef]
- Liu, J.; Sun, W.; Yang, H.; Zeng, Z.; Liu, C.; Zheng, J.; Liu, X.; Rahmani, H.; Sebe, N.; Mian, A., Deep Learning-Based Object Pose Estimation: A Comprehensive Survey. International Journal of Computer Vision 2026, 134, (2), 81-81. [CrossRef]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S., DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. CoRR 2019, abs/1901.04780.
- Sijin, L.; Yu, L.; Zhehao, L.; Guoyuan, L.; Can, W.; Xinyu, W., Vision-Guided Object Recognition and 6D Pose Estimation System Based on Deep Neural Network for Unmanned Aerial Vehicles towards Intelligent Logistics. Applied Sciences 2022, 13, (1), 115-115. [CrossRef]
- Wang, Y.; Wang, M.; Cao, J.; Wang, C.; Wu, Z.; Gao, H., A Novel Fish Pose Estimation Method Based on Semi-Supervised Temporal Context Network. Biomimetics (Basel, Switzerland) 2025, 10, (9), 566-566. [CrossRef]
- Liu, W.; Di, N., RSCS6D: Keypoint Extraction-Based 6D Pose Estimation. Applied Sciences 2025, 15, (12), 6729-6729. [CrossRef]
- Li, P.; Zhang, W., Reading recognition for pointer meters based on SAM and MLLM. Neural Computing and Applications 2026, 38, (9), 331-331. [CrossRef]
- Lang, W.; Xi, L.; Kai, Z.; Zhongwei, L.; Congjun, W.; Yusheng, S., HCCG: Efficient high compatibility correspondence grouping for 3D object recognition and 6D pose estimation in cluttered scenes. Measurement 2022, 197. [CrossRef]
- Rawat, U.; Rai, C. S., Towards geometry-aware attention: key shift adjustment in vision transformers for image feature extraction. Signal, Image and Video Processing 2026, 20, (3), 165-165. [CrossRef]
- Jrondi, Z.; Moussaid, A.; Hadi, M. Y., Exploring End-to-End object detection with transformers versus YOLOv8 for enhanced citrus fruit detection within trees. Systems and Soft Computing 2024, 6, 200103-. [CrossRef]
- Wen, B.; Yang, W.; Kautz, J.; Birchfield, S. In Foundationpose: Unified 6d pose estimation and tracking of novel objects, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; 2024; pp 17868-17879.
- Zhang, H.; He, L.; He, R.; Kadkhodamohammadi, A.; Stoyanov, D.; Davidson, B. R.; Mazomenos, E. B.; Clarkson, M. J., FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality. arXiv 2026 arXiv:2602.17517. [CrossRef]
- Lee, P. K.; Jang, S.; Kim, C. J.; Kim, G.; Yun, H., 6D Pose Estimation of Reflective and Textureless Object with Improved Accuracy through Multi-view Scanning Using Mobile and Stationary Cameras. International Journal of Precision Engineering and Manufacturing 2026, (prepublish), 1-18. [CrossRef]
- Li, Y.; Fang, Y.; Deng, H.; Xu, Y.; Yang, J., High-Fidelity Object Detection and 6D Pose Estimation for Vision-Guided 6-DoF Grasping of Chemical Vials. Signal, Image and Video Processing 2025, 19, (18), 1447-1447. [CrossRef]
- Wang, J.; Liu, G.; Ding, W.; Li, Y.; Song, W., From visual understanding to 6D pose reconstruction: A cutting-edge review of deep learning-based object pose estimation. Displays 2025, 89, 103069-103069. [CrossRef]
- Hwang, H. J.; Cho, J. H.; Kim, Y. T., Deep Learning-Based Real-Time 6D Pose Estimation and Multi-Mode Tracking Algorithms for Citrus-Harvesting Robots. Machines 2024, 12, (9), 642-642. [CrossRef]
- Govi, E.; Sapienza, D.; Toscani, S.; Cotti, I.; Franchini, G.; Bertogna, M., Addressing challenges in industrial pick and place: A deep learning-based 6 Degrees-of-Freedom pose estimation solution. Computers in Industry 2024, 161, 104130-104130. [CrossRef]
- Song, Z.; Tang, W.; Deng, W.; Wang, H.; Huang, G.; Wu, H.; Guo, Y.; Liu, J.; Jin, K.; Ma, Z., An FPGA-Based YOLOv5n Accelerator for Online Multi-Track Particle Localization. Electronics 2026, 15, (4), 810-810. [CrossRef]
- Wang, R.; Tang, F.; Huang, F.; Li, S.; Xu, X.; Xu, Y.; Zhu, L.; Dong, W., Boosting cross-domain semi-supervised medical image segmentation with internal and external regularizations. Pattern Recognition 2026, 179, (PA), 113515-113515. [CrossRef]
- Feng, S.; Pan, X.; Zhang, W.; Pan, M.; Han, C.; Lan, R., QuPaS: SAM-based Semi-supervised Histopathological Image Segmentation with Quantum Force Field Finetuning and Adversarial Estimation. IEEE transactions on medical imaging 2026, PP. [CrossRef]
- Zhang, S.; Gong, P.; Zhang, H.; Li, J.; Bi, S.; Li, A.; Luo, Q.; Feng, Z.; Xiao, C., Brain-SAM: a general automatic SAM-based segmentation model for brain science images. Biomedical optics express 2026, 17, (2), 614-632. [CrossRef]
- Guoyuan, L.; Fan, C.; Yu, L.; Yachun, F.; Can, W.; Xinyu, W., A Manufacturing-Oriented Intelligent Vision System Based on Deep Neural Network for Object Recognition and 6D Pose Estimation . Frontiers in Neurorobotics 2021, 14, 616775-616775. [CrossRef]
- Nasim, H.; Gabriel, L. B.; Harsh, S.; Irene, C., Marker-Less 3d Object Recognition and 6d Pose Estimation for Homogeneous Textureless Objects: An RGB-D Approach. Sensors 2020, 20, (18), 5098-5098. [CrossRef]
- Ren, J.; Li, L.; Li, S.; Liu, M.; Fang, M.; Zhang, S.; Liu, W.; Liu, Y.; Yu, H., Confidence relative off-targets distance-based multi-dimensional transparency evaluation of distribution station area. Frontiers in Energy Research 2024, 11. [CrossRef]










|
Input: RGB-D image I, 3D model Mcad, camera intrinsics K Output: Optimal 6D pose T* Parameters: τ ← 85 (confidence threshold), kmax ← 10 (maximum iterations) |
|
1: k ← 0 2: // Initialization phase 3: bbox ← YOLOv5n(I) ▷ Heuristic bounding box prompt 4: M0 ← SAM(I, bbox) ▷ Initial foreground mask 5: T0 ← FoundationPose(I, M0, M3d) ▷ Initial 6D pose estimation 6: S0 ← ComputeScore(T0, M0, I, M3d, K) ▷ Multi-dimensional confidence score 7: // Iterative refinement phase 8: while Sk < τ and k < kmax do 9: k ← k + 1 10: Mguide ← RenderMask(Tk−1, M3d, K) ▷ Render geometric prior mask 11: Mk ← SAM(I, Mguide) ▷ Mask refinement with spatial prior 12: Tk ← FoundationPose(I, Mk, M3d) ▷ Pose re-estimation with refined mask 13: Sk ← ComputeScore(Tk, Mk, I, M3d, K) ▷ Update confidence score 14: end while 15: return T* ← Tk |
| Target Object (Category) | Object Type Characteristics | ADD(-S) Recall Rate (%) | Average Translation Error (mm) | Average Rotation Error (°) |
| Headphones | Heterogeneous shape / background texture | 98.2 | 3.2 | 1.9 |
| Tape measure | Same-color background confusion | 97.5 | 4.1 | 2.3 |
| Detergent bottle | Cylindrical symmetry / high-frequency reflections | 96.8 | 3.8 | 2.1 |
| Adhesive tape | Low-contrast edges | 94.1 | 5.2 | 3.4 |
| Corn bottle | Complex internal texture | 98.0 | 2.9 | 1.7 |
| Ping-pong ball | Textureless solid-color sphere | 99.2 | 1.8 | 1.2 |
| Computer mouse | Weak texture / dark color | 98.5 | 2.6 | 1.8 |
| Small tire | Strong geometric jagged texture | 97.4 | 3.5 | 2.0 |
| Pliers | Slender irregular structure | 96.1 | 4.4 | 2.6 |
| Earphone case | Multi-source lighting / specular reflections | 95.3 | 4.8 | 3.1 |
| Mean | - | 97.1 | 3.63 | 2.21 |
| Method | Input Data Type | ADD-S Recall Rate (%) | Average Translation Error(mm) | Average Rotation Error (°) |
| PoseCNN | RGB | 45.2 | 35.6 | 22.4 |
| MegaPose | RGB-D | 71.4 | 19.5 | 11.2 |
| FoundationPose | RGB-D | 79.6 | 14.2 | 8.7 |
| Ours(SAM-FPose) | RGB-D | 91.7 | 3.5 | 2.1 |
| Experimental Variant | SAM Segmentation | Closed-Loop Iteration | Multi-Dimensional Assessment | ADD-S(%) | Average Translation Error (mm) |
| A | × | × | × | 72.4 | 16.8 |
| B | √ | × | × | 81.5 | 11.2 |
| C | √ | √ | × | 86.3 | 6.5 |
| D(Ours) | √ | √ | √ | 91.7 | 3.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.