Submitted:
04 February 2026
Posted:
06 February 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction

- We propose MCT-Video, a novel end-to-end optimized causal latent video diffusion model featuring an inherently lightweight transformer backbone specifically designed for ultra-low latency and memory-efficient T2V generation on edge devices.
- We introduce a comprehensive optimization pipeline comprising Adaptive Sparse Temporal Attention (ASTA), Quantization-Aware Fine-tuning (QAF) for W8A8 precision, and a Unified Multi-objective Distillation framework, ensuring holistic efficiency and quality preservation.
- We demonstrate state-of-the-art performance for edge T2V on Qualcomm Hexagon NPUs, achieving superior video quality with significantly reduced inference latency and memory consumption compared to existing highly optimized methods.
2. Related Work
2.1. Text-to-Video Generation
2.2. Efficient AI and Model Compression for Edge Devices
3. Method
3.1. Overall Architecture of MiniCausal-T2V

3.2. Lightweight Causal Transformer Backbone (LCTB)
3.3. Adaptive Sparse Temporal Attention (ASTA)
3.4. Quantization-Aware Fine-tuning (QAF)
3.5. Unified Multi-objective Distillation Strategy
- 1.
- is the reconstruction loss for the VAE, minimizing the difference between the original video frame and its VAE reconstruction . This ensures the VAE maintains high fidelity in encoding and decoding:
- 2.
-
is a feature matching loss, ensuring that intermediate feature representations of the student model (LCTB, VAE, DistilT5) align with those of their respective teacher models. For a given feature layer , this is:This loss helps transfer rich semantic and perceptual information from the teacher to the student.
- 3.
- is the flow-matching loss for the LCTB denoiser, as previously defined in Equation Section 3.2. This term guides the core video generation capability.
- 4.
- is an auxiliary loss term specifically designed for a lightweight first-frame generator. This dedicated component ensures high-quality initial frames, which are critical for establishing visual consistency. It is trained jointly with the LCTB distillation, promoting seamless temporal consistency and coherence between the autonomously generated first frame and the subsequent frames generated by the LCTB. This loss typically involves a reconstruction objective for the first frame.
3.6. Extreme Step Flow-Matching Inference
4. Experiments
4.1. Experimental Setup
4.1.1. Task Definition
4.1.2. Datasets
4.1.3. Training Details

4.1.4. Deployment and Evaluation Hardware
4.2. Baseline Methods
4.3. Quantitative Results

4.4. Ablation Study
| Model | Realism | Temporal Coherence | Text Alignment | Overall Quality | Preference Rate (%) |
| Mobile Hummingbird 26frame | 3.85 | 3.70 | 3.90 | 3.80 | 15.2 |
| SnapGenV | 4.05 | 3.95 | 4.10 | 4.05 | 22.8 |
| Neodragon E2E | 4.15 | 4.10 | 4.20 | 4.15 | 25.5 |
| MCT-Video E2E | 4.35 | 4.30 | 4.35 | 4.40 | 36.5 |
4.5. Human Evaluation
4.6. Efficiency Breakthrough: A Deeper Dive
4.7. The Role of Causal Design and Adaptive Attention
4.8. Synergy of Quantization and Multi-objective Distillation
4.9. Qualitative Analysis and Exemplar Generations
- 1.
- High Text Alignment: The model accurately interprets diverse text prompts, translating intricate descriptions into corresponding visual elements and actions. For instance, a prompt like “A golden retriever puppy frolicking in a field of sunflowers under a clear blue sky” generates a video featuring a puppy with appropriate motion and interactions within the specified environment, matching the semantic content closely.
- 2.
- Realistic Motion and Temporal Coherence: Consistent with its high VBench temporal consistency and human evaluation scores, MCT-Video generates fluid and believable motion. Movements are smooth, and objects interact realistically with their environment. For example, a video generated from “A majestic eagle soaring gracefully over a snow-capped mountain range” demonstrates continuous, sweeping flight paths and appropriate camera movements, avoiding jitter or abrupt scene changes.
- 3.
- Flicker Reduction: The meticulous design, including causal attention and robust training, minimizes flickering artifacts commonly seen in efficient video generation models. This results in a stable visual experience, enhancing overall perceptual quality.
- 4.
- Sharpness and Detail: Despite operating at W8A8 precision and undergoing significant compression, the reconstructed frames from the VAE (trained with and distillation) maintain a high degree of sharpness and detail. A prompt such as “A vintage car driving down a cobblestone street in Paris, rain falling lightly” renders intricate details of the car’s chrome, wet cobblestones, and the soft blur of rain, contributing to a realistic aesthetic.
- 5.
- Effective Scene Understanding: The model effectively composes complex scenes as indicated by its high VBench Scene score (57.20). Prompts involving multiple objects, backgrounds, and interactions like “A group of children building a sandcastle on a sunny beach, waves gently lapping at the shore” correctly place all elements in a harmonious and dynamic scene.
5. Conclusions
References
- Wenwen Liu. Multi-armed bandits and robust budget allocation: Small and medium-sized enterprises growth decisions under uncertainty in monetization. European Journal of AI, Computing & Informatics 2025, 1(4), 89–97.
- Wenwen Liu. Few-shot and domain adaptation modeling for evaluating growth strategies in long-tail small and medium-sized enterprises. Journal of Industrial Engineering and Applied Science 2025, 3(6), 30–35. [CrossRef]
- Wenwen Liu. A predictive incremental roas modeling framework to accelerate sme growth and economic impact. Journal of Economic Theory and Business Management 2025, 2(6), 25–30.
- Xuanyu Zhang, Runyi Li, Jiwen Yu, Youmin Xu, Weiqi Li, and Jian Zhang. Editguard: Versatile image watermarking for tamper localization and copyright protection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11964–11974, 2024.
- Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, and Jian Zhang. Omniguard: Hybrid manipulation localization via augmented versatile deep image watermarking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3008–3018, 2025.
- Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. arXiv 2024, arXiv:2410.02761.
- Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-LLaVA: Learning united visual representation by alignment before projection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984. Association for Computational Linguistics, 2024. 2024.
- Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, and Luke Zettlemoyer. VLM: Task-agnostic video-language model pre-training for video understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4227–4239. Association for Computational Linguistics, 2021.
- Adithya V Ganesan, Matthew Matero, Aravind Reddy Ravula, Huy Vu, and H. Andrew Schwartz. Empirical evaluation of pre-trained transformers for human-level NLP: The role of sample size and dimensionality. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4515–4532. Association for Computational Linguistics, 2021.
- Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4163–4181. Association for Computational Linguistics, 2022.
- Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017. Association for Computational Linguistics, 2023.
- Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151. Association for Computational Linguistics, 2021.
- Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787–6800. Association for Computational Linguistics, 2021.
- Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander Hauptmann. Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2443–2459. Association for Computational Linguistics, 2021.
- Zineng Tang, Jie Lei, and Mohit Bansal. DeCEMBERT: Learning from noisy instructional videos via dense captions and entropy minimization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2415–2426. Association for Computational Linguistics, 2021.
- Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta, and Dan B Goldman. Over++: Generative video compositing for layer interaction effects. arXiv 2025, arXiv:2512.19661.
- Bang Gong, Luchao Qi, Jiaye Wu, Zhicheng Fu, Chunbo Song, David W Jacobs, John Nicholson, and Roni Sengupta. The aging multiverse: Generating condition-aware facial aging tree via training-free diffusion. arXiv 2025, arXiv:2506.21008.
- Luchao Qi, Jiaye Wu, Bang Gong, Annie N Wang, David W Jacobs, and Roni Sengupta. Mytimemachine: Personalized facial age transformation. ACM Transactions on Graphics (TOG) 2025, 44(4), 1–16.
- Quanyu Long, Mingxuan Wang, and Lei Li. Generative imagination elevates machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5738–5748, 2021.
- Quanyu Long, Yin Wu, Wenya Wang, and Sinno Jialin Pan. Does in-context learning really learn? rethinking how large language models respond and solve tasks via in-context learning. arXiv 2024, arXiv:2404.07546.
- Quanyu Long, Yue Deng, Leilei Gan, Wenya Wang, and Sinno Jialin Pan. Backdoor attacks on dense retrieval via public and unintentional triggers. In Second Conference on Language Modeling, 2025.
- Pei Ke, Haozhe Ji, Yu Ran, Xin Cui, Liwei Wang, Linfeng Song, Xiaoyan Zhu, and Minlie Huang. JointGT: Graph-text joint representation learning for text generation from knowledge graphs. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2526–2538. Association for Computational Linguistics, 2021.
- Timo Schick and Hinrich Schütze. Few-shot text generation with natural language instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 390–402. Association for Computational Linguistics, 2021.
- Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin Jin, Xin Jiang, Qun Liu, Michael Lyu, and Irwin King. BinaryBERT: Pushing the limit of BERT quantization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4334–4348. Association for Computational Linguistics, 2021.
- Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7947–7969. Association for Computational Linguistics, 2021.
- Franois Lagunas, Ella Charlaix, Victor Sanh, and Alexander Rush. Block pruning for faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10619–10629. Association for Computational Linguistics, 2021.
- Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao Feng, Zhiyong Wu, Tao Yu, and Lingpeng Kong. ZeroGen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11653–11669. Association for Computational Linguistics, 2022.
- Lingwei Wei, Dou Hu, Wei Zhou, Zhaojuan Yue, and Songlin Hu. Towards propagation uncertainty: Edge-enhanced Bayesian graph convolutional networks for rumor detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3845–3854. Association for Computational Linguistics, 2021.
- Amanda Cercas Curry, Gavin Abercrombie, and Verena Rieser. ConvAbuse: Data, analysis, and benchmarks for nuanced abuse detection in conversational AI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7388–7403. Association for Computational Linguistics, 2021.
| Optimization Component | Est. Latency Reduction (%) | Cumulative Latency Reduction (%) |
| Lightweight Causal Transformer Backbone (LCTB) | 30% | 30% |
| Adaptive Sparse Temporal Attention (ASTA) | 15% | 45% |
| Extreme Step Flow-Matching Inference | 40% | 85% |
| Quantization-Aware Fine-tuning (QAF) | 10% | 95% |
| Unified Multi-objective Distillation | 5% | 100% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).