Submitted:
02 February 2026
Posted:
04 February 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction

- We propose StreamEdit-DiT, a novel DiT-based architecture tailored for real-time streaming Text-to-Video editing, featuring a Progressive Temporal Consistency Module (PTCM) and Dynamic Sparse Attention (DSA) for enhanced temporal coherence and computational efficiency.
- We introduce a comprehensive training framework incorporating Streaming Coherence Matching (SCM) with an Adaptive Sliding Window (ASW) buffer mechanism, complemented by a Hierarchical Progressive Distillation strategy, enabling efficient and high-fidelity real-time streaming generation at just 6 sampling steps.
- We demonstrate that StreamEdit-DiT achieves state-of-the-art performance on a challenging V2V editing benchmark, successfully delivering on the promise of real-time streaming video editing at 512p resolution and 18 FPS on a single H100 GPU, significantly advancing the field of interactive video content creation.
2. Related Work
2.1. Text-to-Video Generation and Diffusion Transformers
2.2. Real-Time Streaming Video Editing and Efficiency Optimization
3. Method
3.1. Model Architecture

3.1.1. Progressive Temporal Consistency Module (PTCM)
3.1.2. Dynamic Sparse Attention (DSA)
3.2. Training Framework and Objectives
3.2.1. Streaming Coherence Matching (SCM)
3.2.2. Adaptive Sliding Window (ASW)
3.3. Distillation for Real-Time Inference
3.4. Component Reusability and Preprocessing
3.4.1. Spatio-Temporal Latent Encoder (STLE)
3.4.2. Text Encoders
3.5. Model Variants
4. Experiments
4.1. Experimental Setup
4.1.1. Training Workflow
- 1.
- Base Task Learning: Initially, a foundational T2V editing model is pre-trained using 5K high-quality video-text instruction pairs. This stage focuses on equipping the model with a robust understanding of text-guided video content editing, employing a moderate learning rate of 5e-5.
- 2.
- Streaming Adaptation: Following base task learning, the model undergoes a critical adaptation phase. We utilize a 3M large-scale general video-text dataset to refine the model for streaming inputs. During this stage, a smaller learning rate of 1e-5 is used in conjunction with our proposed Streaming Coherence Matching (SCM) objective and Adaptive Sliding Window (ASW) mechanism. This phase is crucial for enhancing the model’s generalization capabilities and consistency over extended video streams.
- 3.
- Quality Fine-tuning & Interactivity Optimization: The final stage involves fine-tuning the model on specific high-quality datasets. This aims to further optimize the editing nuances and improve the model’s responsiveness to dynamic real-time prompt changes, ensuring a highly interactive editing experience.
4.1.2. Distillation for Real-Time Inference
4.1.3. Data Processing and Representation
4.2. Quantitative Evaluation
4.2.1. Comparison with Baseline Methods
4.3. Ablation Study
4.4. Human Evaluation
4.5. Inference Speed and Efficiency Analysis
4.6. Scalability and Model Variant Performance
4.7. Impact of Adaptive Sliding Window (ASW) Configuration
5. Conclusion
References
- Lin, Bin; Ye, Yang; Zhu, Bin; Cui, Jiaxi; Ning, Munan; Jin, Peng; Yuan, Li. Video-LLaVA: Learning united visual representation by alignment before projection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024; Association for Computational Linguistics; pp. pages 5971–5984. [Google Scholar]
- Guan, Jian; Mao, Xiaoxi; Fan, Changjie; Liu, Zitao; Ding, Wenbiao; Huang, Minlie. Long text generation by modeling sentence-level and discourse-level coherence. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021, Volume 1, pages 6379–6393. [Google Scholar]
- Lei, Jie; Berg, Tamara; Bansal, Mohit. Revealing single frame bias for video-and-language learning. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics 2023, Volume 1, 487–507. [Google Scholar]
- Bian, Weikang; Huang, Zhaoyang; Shi, Xiaoyu; Li, Yijin; Wang, Fu-Yun; Li, Hongsheng. Gs-dit: Advancing video generation with pseudo 4d gaussian fields through efficient dense 3d point tracking. CoRR 2025. [Google Scholar]
- Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara; Schuster, Tal; Zheng, Huaixiu Steven; Zhou, Denny; Houlsby, Neil; Metzler, Donald. Ul2: Unifying language learning paradigms. arXiv 2022, arXiv:2205.05131v3. [Google Scholar]
- Hendricks, Lisa Anne; Nematzadeh, Aida. Probing image-language transformers for verb understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics, 2021; pp. pages 3635–3644. [Google Scholar]
- Xu, Hu; Ghosh, Gargi; Huang, Po-Yao; Okhonko, Dmytro; Aghajanyan, Armen; Metze, Florian; Zettlemoyer, Luke; Feichtenhofer, Christoph. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021; Association for Computational Linguistics; pp. pages 6787–6800. [Google Scholar]
- Huang, Po-Yao; Patrick, Mandela; Hu, Junjie; Neubig, Graham; Metze, Florian; Hauptmann, Alexander. Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021; Association for Computational Linguistics; pp. pages 2443–2459. [Google Scholar]
- Cao, Meng; Chen, Long; Shou, Mike Zheng; Zhang, Can; Zou, Yuexian. On pursuit of designing multi-modal transformer for video grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021; Association for Computational Linguistics; pp. pages 9810–9823. [Google Scholar]
- Qi, Luchao; Wu, Jiaye; Choi, Jun Myeong; Phillips, Cary; Sengupta, Roni; Goldman, Dan B. Over++: Generative video compositing for layer interaction effects. arXiv 2025, arXiv:2512.19661. [Google Scholar]
- Qi, Luchao; Wu, Jiaye; Gong, Bang; Wang, Annie N; Jacobs, David W; Sengupta, Roni. Mytimemachine: Personalized facial age transformation. ACM Transactions on Graphics (TOG) 2025, 44(4), 1–16. [Google Scholar] [CrossRef]
- Gong, Bang; Qi, Luchao; Wu, Jiaye; Fu, Zhicheng; Song, Chunbo; Jacobs, David W; Nicholson, John; Sengupta, Roni. The aging multiverse: Generating condition-aware facial aging tree via training-free diffusion. arXiv 2025, arXiv:2506.21008. [Google Scholar]
- He, Zhengfu; Sun, Tianxiang; Tang, Qiong; Wang, Kuanning; Huang, Xuanjing; Qiu, Xipeng. DiffusionBERT: Improving generative masked language models with diffusion models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics 2023, Volume 1, pages 4521–4534. [Google Scholar]
- Tay, Yi; Dehghani, Mostafa; Gupta, Jai Prakash; Aribandi, Vamsi; Bahri, Dara; Qin, Zhen; Metzler, Donald. Are pretrained convolutions better than pretrained transformers? Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021, Volume 1, pages 4349–4359. [Google Scholar]
- Shao, Zhihong; Gong, Yeyun; Shen, Yelong; Huang, Minlie; Duan, Nan; Chen, Weizhu. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics, 2023; pp. pages 9248–9274. [Google Scholar]
- Honovich, Or; Choshen, Leshem; Aharoni, Roee; Neeman, Ella; Szpektor, Idan; Abend, Omri. q2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021; Association for Computational Linguistics; pp. pages 7856–7870. [Google Scholar]
- Seo, Ahjeong; Kang, Gi-Cheon; Park, Joonhan; Zhang, Byoung-Tak. Attend what you need: Motion-appearance synergistic networks for video question answering. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021, Volume 1, pages 6167–6177. [Google Scholar]
- Maaz, Muhammad; Rasheed, Hanoona; Khan, Salman; Khan, Fahad. Video-ChatGPT: Towards detailed video understanding via large vision and language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 2024, Volume 1, pages 12585–12602. [Google Scholar]
- Dai, Xiang; Chalkidis, Ilias; Darkner, Sune; Elliott, Desmond. Revisiting transformer-based models for long document classification. In Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics, 2022; pp. pages 7212–7230. [Google Scholar]
- Tang, Zineng; Lei, Jie; Bansal, Mohit. DeCEMBERT: Learning from noisy instructional videos via dense captions and entropy minimization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021; Association for Computational Linguistics; pp. pages 2415–2426. [Google Scholar]
- Huang, Sichong. Reinforcement learning with reward shaping for last-mile delivery dispatch efficiency. European Journal of Business, Economics & Management 2025, 1(4), 122–130. [Google Scholar]
- Huang, Sichong. Prophet with exogenous variables for procurement demand prediction under market volatility. Journal of Computer Technology and Applied Mathematics 2025, 2(6), 15–20. [Google Scholar] [CrossRef]
- Liu, Wenwen. Few-shot and domain adaptation modeling for evaluating growth strategies in long-tail small and medium-sized enterprises. Journal of Industrial Engineering and Applied Science 2025, 3(6), 30–35. [Google Scholar] [CrossRef]
- Sun, Haohai; Zhong, Jialun; Ma, Yunpu; Han, Zhen; He, Kun. TimeTraveler: Reinforcement learning for temporal knowledge graph forecasting. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021; Association for Computational Linguistics; pp. pages 8306–8319. [Google Scholar]
- Reid, Machel; Zhong, Victor. LEWIS: Levenshtein editing for unsupervised text style transfer. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics, 2021; pp. pages 3932–3944. [Google Scholar]
- Pryzant, Reid; Iter, Dan; Li, Jerry; Lee, Yin; Zhu, Chenguang; Zeng, Michael. Automatic prompt optimization with “gradient descent” and beam search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023; Association for Computational Linguistics; pp. pages 7957–7968. [Google Scholar]
- Zhang, Xuanyu; Li, Runyi; Yu, Jiwen; Xu, Youmin; Li, Weiqi; Zhang, Jian. Editguard: Versatile image watermarking for tamper localization and copyright protection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; pp. 11964–11974. [Google Scholar]
- Zhang, Xuanyu; Tang, Zecheng; Xu, Zhipei; Li, Runyi; Xu, Youmin; Chen, Bin; Gao, Feng; Zhang, Jian. Omniguard: Hybrid manipulation localization via augmented versatile deep image watermarking. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. pages 3008–3018. [Google Scholar]
- Xu, Zhipei; Zhang, Xuanyu; Li, Runyi; Tang, Zecheng; Huang, Qing; Zhang, Jian. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. arXiv 2024, arXiv:2410.02761. [Google Scholar]
| Method | PA ↑ | IP ↑ | TC ↑ | MS ↑ | EF ↑ | OQ ↑ |
| StreamEditNet | 0.9452 | 0.9580 | 0.9785 | 0.9921 | 0.7850 | 0.8055 |
| ConsistentV2V | 0.9510 | 0.9692 | 0.9810 | 0.9880 | 0.7925 | 0.8120 |
| Ours (2B) | 0.9635 | 0.9650 | 0.9805 | 0.9902 | 0.8100 | 0.8280 |
| Ours-distill (2B) | 0.9601 | 0.9620 | 0.9790 | 0.9895 | 0.8055 | 0.8250 |
| Method | PA ↑ | IP ↑ | TC ↑ | MS ↑ | EF ↑ | OQ ↑ |
| StreamEdit-DiT-2B w/o PTCM | 0.9580 | 0.9505 | 0.9620 | 0.9850 | 0.7980 | 0.8100 |
| StreamEdit-DiT-2B w/o DSA | 0.9610 | 0.9595 | 0.9750 | 0.9880 | 0.8050 | 0.8200 |
| StreamEdit-DiT-2B w/o SCM+ASW | 0.9590 | 0.9520 | 0.9685 | 0.9865 | 0.8000 | 0.8150 |
| StreamEdit-DiT-2B (Full) | 0.9635 | 0.9650 | 0.9805 | 0.9902 | 0.8100 | 0.8280 |
| Method | Perceptual Quality ↑ | Temporal Consistency ↑ | Prompt Adherence ↑ |
| StreamEditNet | 3.5 | 3.6 | 3.4 |
| ConsistentV2V | 3.6 | 3.8 | 3.5 |
| Ours (2B) | 4.2 | 4.1 | 4.0 |
| Ours-distill (2B) | 4.0 | 3.9 | 3.8 |
| Method | Res. | FPS ↑ | Latency (ms) ↓ | GPU Mem. (GB) ↓ |
| StreamEditNet | 512p | 15 | 65 | 16 |
| ConsistentV2V | 512p | 8 | 120 | 20 |
| Ours (2B) | 512p | 10 | 100 | 22 |
| Ours-distill (2B) | 512p | 18 | 55 | 18 |
| Model | Params (B) | PA ↑ | TC ↑ | EF ↑ | OQ ↑ | FPS (H100) ↑ |
| Ours (2B) | 2 | 0.9635 | 0.9805 | 0.8100 | 0.8280 | 10 |
| Ours-distill (2B) | 2 | 0.9601 | 0.9790 | 0.8055 | 0.8250 | 18 |
| Ours (10B) | 10 | 0.9700 | 0.9850 | 0.8250 | 0.8400 | 3 |
| ASW Max Len. (frames) | TC ↑ | IP ↑ | Mem. (GB) ↓ | Latency (ms) ↓ |
| 4 (Fixed Window) | 0.9700 | 0.9550 | 16 | 60 |
| 8 (Fixed Window) | 0.9780 | 0.9600 | 19 | 75 |
| 16 (Fixed Window) | 0.9800 | 0.9620 | 24 | 90 |
| Adaptive (Our Proposed) | 0.9805 | 0.9650 | 22 | 70 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).