Submitted:
04 February 2026
Posted:
05 February 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction

- We propose TempCo-Painter, a novel video inpainting framework featuring an Adaptive Diffusion Transformer (ADiT) that incorporates hierarchical spatial-temporal attention, motion-guided attention, and dynamic mask awareness, specifically designed for enhanced spatio-temporal consistency.
- We introduce an enhanced MultiDiffusion strategy for efficient and consistent long video inpainting, leveraging an adaptive sliding window and a temporal smoothing regularization term to maintain global coherence across extended sequences.
- We demonstrate state-of-the-art performance of TempCo-Painter across various video inpainting tasks, achieving superior quantitative metrics (e.g., PSNR, SSIM, and significantly lower VFID) and qualitative results on both short and challenging long video datasets, particularly excelling in temporal consistency and efficiency.
2. Related Work
2.1. Video Inpainting and Completion
2.2. Diffusion Models and Transformers for Video Generation
3. Method
3.1. Overview of TempCo-Painter
3.2. 3D-VAE Encoding and Latent Space Representation
3.3. Adaptive Diffusion Transformer (ADiT)
3.3.1. Hierarchical Spatial-Temporal Attention
3.3.2. Motion-Guided Attention Mechanism
3.3.3. Dynamic Mask Awareness
3.4. Flow Matching Scheduler and Efficient Inference
3.5. Enhanced MultiDiffusion for Long Video Processing
3.5.1. Adaptive Sliding Window Strategy
3.5.2. Temporal Smoothing Regularization
3.6. Training Objective
4. Experiments
4.1. Experimental Setup
4.2. Evaluation Metrics
- PSNR (Peak Signal-to-Noise Ratio) (↑): Measures the pixel-wise accuracy of the inpainted content compared to the ground truth. Higher values indicate better quality.
- SSIM (Structural Similarity Index Measure) (↑): Assesses the structural similarity between the inpainted and ground truth videos, considering luminance, contrast, and structure. Higher values indicate better structural preservation.
- VFID (Video Frechet Inception Distance) (↓): Evaluates the overall quality and perceptual realism of the generated video by comparing its feature distribution to that of real videos. A lower VFID score signifies superior video quality and, crucially, better spatio-temporal consistency, as it captures flickering and unnatural temporal dynamics.
4.3. Comparison with State-of-the-Art Methods
- At 4 denoising steps, TempCo-Painter (4 steps) achieves a PSNR of 34.92 and an SSIM of 0.9846, both slightly surpassing DiTPainter (4 steps). Crucially, its VFID of 0.054 is lower, indicating enhanced visual realism and temporal consistency even with fewer inference steps. This highlights our method’s efficiency in generating high-quality repairs.
- The performance gap becomes more pronounced at 8 denoising steps, where TempCo-Painter (8 steps) achieves the lowest VFID score of 0.049 among all methods. This significantly outperforms DiTPainter (8 steps) at 0.051, providing strong evidence that our proposed hierarchical spatial-temporal attention and motion-guided attention mechanisms effectively maintain spatio-temporal consistency throughout the deeper denoising process. The lower VFID suggests that the generated content’s distribution is closer to real video distributions, resulting in less flickering, smoother motion, and overall more photorealistic visual effects.
4.4. Ablation Study
- Adding Hierarchical Spatial-Temporal Attention notably improves the VFID from 0.058 to 0.055, alongside gains in PSNR and SSIM. This highlights the importance of multi-scale temporal modeling for reducing flickering and enhancing global coherence.
- Incorporating the Motion-Guided Attention further boosts performance, reducing VFID to 0.052. This validates its role in accurately predicting motion trajectories and textures in dynamic occluded regions, leading to smoother and more plausible animations.
- The introduction of Dynamic Mask Awareness provides another incremental improvement, bringing the VFID down to 0.051. This component enhances the model’s robustness to diverse and complex mask patterns, ensuring consistent repair even with moving or irregular occlusions.
- The full TempCo-Painter model, combining all proposed innovations, achieves the best results with a VFID of 0.049, indicating that these components synergistically contribute to superior spatio-temporal consistency and overall inpainting quality. While the full impact of Enhanced MultiDiffusion is more pronounced in long video scenarios, its inherent design philosophies (adaptability, smoothing) are reflected in the robust performance on short videos as well, by fostering a more stable and consistent generation process within the fixed window.
4.5. Human Evaluation
4.6. Long Video Inpainting Performance
4.7. Inference Efficiency Analysis
4.8. Robustness to Challenging Mask Scenarios
5. Conclusions
References
- Xu, H.; Yan, M.; Li, C.; Bi, B.; Huang, S.; Xiao, W.; Huang, F. E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning. Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021, Volume 1, 503–513. [Google Scholar] [CrossRef]
- Xu, H.; Ghosh, G.; Huang, P.Y.; Arora, P.; Aminzadeh, M.; Feichtenhofer, C.; Metze, F.; Zettlemoyer, L. VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics, 2021; pp. 4227–4239. [Google Scholar] [CrossRef]
- Xu, H.; Ghosh, G.; Huang, P.Y.; Okhonko, D.; Aghajanyan, A.; Metze, F.; Zettlemoyer, L.; Feichtenhofer, C. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021; Association for Computational Linguistics; pp. 6787–6800. [Google Scholar] [CrossRef]
- Zhang, X.; Li, R.; Yu, J.; Xu, Y.; Li, W.; Zhang, J. Editguard: Versatile image watermarking for tamper localization and copyright protection. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; pp. 11964–11974. [Google Scholar]
- Zhang, X.; Tang, Z.; Xu, Z.; Li, R.; Xu, Y.; Chen, B.; Gao, F.; Zhang, J. Omniguard: Hybrid manipulation localization via augmented versatile deep image watermarking. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 3008–3018. [Google Scholar]
- Xu, Z.; Zhang, X.; Li, R.; Tang, Z.; Huang, Q.; Zhang, J. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. arXiv 2024, arXiv:2410.02761. [Google Scholar]
- Wu, X.; Liu, C. DiTPainter: Efficient Video Inpainting with Diffusion Transformers. CoRR 2025. [Google Scholar] [CrossRef]
- Aji, A.F.; Winata, G.I.; Koto, F.; Cahyawijaya, S.; Romadhony, A.; Mahendra, R.; Kurniawan, K.; Moeljadi, D.; Prasojo, R.E.; Baldwin, T.; et al. One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, 2022; Volume 1, pp. 7226–7249. [Google Scholar] [CrossRef]
- Zhou, S.; Li, C.; Chan, K.C.K.; Loy, C.C. ProPainter: Improving Propagation and Transformer for Video Inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023; IEEE; 2023, pp. 10443–10452. [Google Scholar] [CrossRef]
- Huang, S. Reinforcement Learning with Reward Shaping for Last-Mile Delivery Dispatch Efficiency. European Journal of Business, Economics & Management 2025, 1, 122–130. [Google Scholar]
- Huang, S. Prophet with Exogenous Variables for Procurement Demand Prediction under Market Volatility. Journal of Computer Technology and Applied Mathematics 2025, 2, 15–20. [Google Scholar] [CrossRef]
- Liu, W. Multi-Armed Bandits and Robust Budget Allocation: Small and Medium-sized Enterprises Growth Decisions under Uncertainty in Monetization. European Journal of AI, Computing & Informatics 2025, 1, 89–97. [Google Scholar]
- Zhang, H.; Tao, M.; Shi, Y.; Bi, X. Federated multi-task learning with non-stationary heterogeneous data. In Proceedings of the ICC 2022-IEEE International Conference on Communications. IEEE, 2022; pp. 4950–4955. [Google Scholar]
- Zhang, H.; Tao, M.; Shi, Y.; Bi, X.; Letaief, K.B. Federated multi-task learning with non-stationary and heterogeneous data in wireless networks. IEEE Transactions on Wireless Communications 2023, 23, 2653–2667. [Google Scholar] [CrossRef]
- Long, Q.; Wang, M.; Li, L. Generative imagination elevates machine translation. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021; pp. 5738–5748. [Google Scholar]
- Long, Q.; Wu, Y.; Wang, W.; Pan, S.J. Does in-context learning really learn? rethinking how large language models respond and solve tasks via in-context learning. arXiv 2024, arXiv:2404.07546. [Google Scholar]
- Long, Q.; Deng, Y.; Gan, L.; Wang, W.; Pan, S.J. Backdoor attacks on dense retrieval via public and unintentional triggers. In Proceedings of the Second Conference on Language Modeling, 2025. [Google Scholar]
- Zhou, B.; Richardson, K.; Ning, Q.; Khot, T.; Sabharwal, A.; Roth, D. Temporal Reasoning on Implicit Events from Distant Supervision. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021; Association for Computational Linguistics; pp. 1361–1371. [Google Scholar] [CrossRef]
- Seo, A.; Kang, G.C.; Park, J.; Zhang, B.T. Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Association for Computational Linguistics, 2021; Volume 1, pp. 6167–6177. [Google Scholar] [CrossRef]
- Zhang, W.; Li, X.; Deng, Y.; Bing, L.; Lam, W. Towards Generative Aspect-Based Sentiment Analysis. Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021, Volume 2, 504–510. [Google Scholar] [CrossRef]
- Tang, Z.; Lei, J.; Bansal, M. DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021; Association for Computational Linguistics; pp. 2415–2426. [Google Scholar] [CrossRef]
- Lei, J.; Berg, T.; Bansal, M. Revealing Single Frame Bias for Video-and-Language Learning. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, 2023; Volume 1, pp. 487–507. [Google Scholar] [CrossRef]
- Maaz, M.; Rasheed, H.; Khan, S.; Khan, F. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 2024, Volume 1, 12585–12602. [Google Scholar] [CrossRef]
- Yang, J.; Yu, Y.; Niu, D.; Guo, W.; Xu, Y. ConFEDE: Contrastive Feature Decomposition for Multimodal Sentiment Analysis. Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics 2023, Volume 1, 7617–7630. [Google Scholar] [CrossRef]
- Hoxha, A.; Shehu, B.; Kola, E.; Koklukaya, E. A Survey of Generative Video Models as Visual Reasoners. 2026. [Google Scholar] [CrossRef] [PubMed]
- Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; Yuan, L. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024; Association for Computational Linguistics; pp. 5971–5984. [Google Scholar] [CrossRef]
- Qi, L.; Wu, J.; Choi, J.M.; Phillips, C.; Sengupta, R.; Goldman, D.B. Over++: Generative Video Compositing for Layer Interaction Effects. arXiv 2025, arXiv:2512.19661. [Google Scholar]
- Gong, B.; Qi, L.; Wu, J.; Fu, Z.; Song, C.; Jacobs, D.W.; Nicholson, J.; Sengupta, R. The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion. arXiv 2025, arXiv:2506.21008. [Google Scholar]
- Qi, L.; Wu, J.; Gong, B.; Wang, A.N.; Jacobs, D.W.; Sengupta, R. Mytimemachine: Personalized facial age transformation. ACM Transactions on Graphics (TOG) 2025, 44, 1–16. [Google Scholar] [CrossRef]
- Tang, R.; Liu, L.; Pandey, A.; Jiang, Z.; Yang, G.; Kumar, K.; Stenetorp, P.; Lin, J.; Ture, F. What the DAAM: Interpreting Stable Diffusion Using Cross Attention. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, 2023; Volume 1, pp. 5644–5659. [Google Scholar] [CrossRef]
- Chen, P.C.; Tsai, H.; Bhojanapalli, S.; Chung, H.W.; Chang, Y.W.; Ferng, C.S. A Simple and Effective Positional Encoding for Transformers. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021; Association for Computational Linguistics; pp. 2974–2988. [Google Scholar] [CrossRef]
- Hendricks, L.A.; Nematzadeh, A. Probing Image-Language Transformers for Verb Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics, 2021; pp. 3635–3644. [Google Scholar] [CrossRef]
- Wen, H.; Lin, Y.; Lai, T.; Pan, X.; Li, S.; Lin, X.; Zhou, B.; Li, M.; Wang, H.; Zhang, H.; et al. RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations; Association for Computational Linguistics, 2021; pp. 133–143. [Google Scholar] [CrossRef]
- Kamalloo, E.; Dziri, N.; Clarke, C.; Rafiei, D. Evaluating Open-Domain Question Answering in the Era of Large Language Models. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, 2023; Volume 1, pp. 5591–5606. [Google Scholar] [CrossRef]
- Silva, A.; Tambwekar, P.; Gombolay, M. Towards a Comprehensive Understanding and Accurate Evaluation of Societal Biases in Pre-Trained Transformers. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021; Association for Computational Linguistics; pp. 2383–2389. [Google Scholar] [CrossRef]
- Liu, Y.; Guan, R.; Giunchiglia, F.; Liang, Y.; Feng, X. Deep Attention Diffusion Graph Neural Networks for Text Classification. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021; Association for Computational Linguistics; pp. 8142–8152. [Google Scholar] [CrossRef]
- Zhou, Z.; de Melo, M.L.; Rios, T.A. Toward Multimodal Agent Intelligence: Perception, Reasoning, Generation and Interaction. 2025. [Google Scholar] [CrossRef] [PubMed]
- Qian, W.; Shang, Z.; Wen, D.; Fu, T. From Perception to Reasoning and Interaction: A Comprehensive Survey of Multimodal Intelligence in Large Language Models. In Authorea Preprints; 2025. [Google Scholar]


| Method | PSNR (↑) | SSIM (↑) | VFID (↓) |
|---|---|---|---|
| ProPainter [9] | 34.46 | 0.9834 | 0.069 |
| DiTPainter (4 steps) [7] | 34.86 | 0.9844 | 0.056 |
| DiTPainter (8 steps) [7] | 34.60 | 0.9843 | 0.051 |
| TempCo-Painter (4 steps) | 34.92 | 0.9846 | 0.054 |
| TempCo-Painter (8 steps) | 34.75 | 0.9845 | 0.049 |
| Method | PSNR (↑) | SSIM (↑) | VFID (↓) |
|---|---|---|---|
| TempCo-Painter (Base) | 34.50 | 0.9839 | 0.058 |
| + Hierarchical Spatial-Temporal Attention | 34.62 | 0.9841 | 0.055 |
| + Motion-Guided Attention | 34.69 | 0.9843 | 0.052 |
| + Dynamic Mask Awareness | 34.72 | 0.9844 | 0.051 |
| TempCo-Painter (Full) | 34.75 | 0.9845 | 0.049 |
| Method | PSNR (↑) | SSIM (↑) | VFID (↓) |
|---|---|---|---|
| ProPainter [9] | 32.15 | 0.9781 | 0.088 |
| DiTPainter (8 steps) [7] | 32.89 | 0.9802 | 0.065 |
| TempCo-Painter (8 steps) | 33.61 | 0.9818 | 0.053 |
| Method | Params (M) | GFLOPs (↓) | Inf. Time (s/f) (↓) | FPS (↑) |
|---|---|---|---|---|
| ProPainter [9] | 180 | 1200 | 0.35 | 2.86 |
| DiTPainter (4 steps) [7] | 250 | 1500 | 0.28 | 3.57 |
| DiTPainter (8 steps) [7] | 250 | 3000 | 0.56 | 1.79 |
| TempCo-Painter (4 steps) | 220 | 1350 | 0.25 | 4.00 |
| TempCo-Painter (8 steps) | 220 | 2700 | 0.50 | 2.00 |
| Mask Scenario | PSNR (↑) | SSIM (↑) | VFID (↓) | Method |
|---|---|---|---|---|
| SSM | 35.80 | 0.9870 | 0.045 | ProPainter [9] |
| 36.15 | 0.9878 | 0.041 | DiTPainter [7] | |
| 36.28 | 0.9881 | 0.038 | TempCo-Painter | |
| LSM | 33.95 | 0.9820 | 0.068 | ProPainter [9] |
| 34.20 | 0.9825 | 0.060 | DiTPainter [7] | |
| 34.45 | 0.9830 | 0.055 | TempCo-Painter | |
| SDM | 33.50 | 0.9810 | 0.075 | ProPainter [9] |
| 33.85 | 0.9815 | 0.069 | DiTPainter [7] | |
| 34.10 | 0.9822 | 0.062 | TempCo-Painter | |
| LDM | 31.90 | 0.9760 | 0.095 | ProPainter [9] |
| 32.25 | 0.9775 | 0.081 | DiTPainter [7] | |
| 32.70 | 0.9789 | 0.072 | TempCo-Painter |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).