Submitted:
31 December 2025
Posted:
31 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Method: Multimodal Supervisory Graphs
4. Experiments
5. Conclusions
References
- Ho, J.; et al. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv preprint arXiv:2210.02303 2022.
- Singer, U.; et al. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv preprint arXiv:2209.14792 2022.
- Ho, J.; et al. Video Diffusion Models. arXiv preprint arXiv:2204.03458 2022.
- Song, Y.; Huang, S.; Kang, Y. Temporal-ID: Robust Identity Preservation in Long-Form Video Generation via Adaptive Memory Banks.
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the Proc. ECCV, 2020.
- Kerbl, B.; et al. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. (Proc. SIGGRAPH) 2023.
- Huang, S.; Kang, Y.; Song, Y. FaceSplat: A Lightweight, Prior-Guided Framework for High-Fidelity 3D Face Reconstruction from a Single Image.
- Shao, X.; et al. FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance. In Proceedings of the Proc. CVPR, 2025.
- Liu, S.; Ren, Z.; Gupta, S.; Wang, S. PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation. In Proceedings of the Proc. ECCV, 2024.
- Song, Y.; Kang, Y.; Huang, S. VACE-PhysicsRL: Unified Controllable Video Generation through Physical Laws and Reinforcement Learning Alignment.
- Song, Y.; Kang, Y.; Huang, S. Context-Aware Real-Time 3D Generation and Visualization in Augmented Reality Smart Glasses: A Museum Application.
- Ha, D.; Schmidhuber, J. Recurrent World Models Facilitate Policy Evolution. In Proceedings of the Proc. NeurIPS, 2018.
- Wu, J.Z.; et al. Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. In Proceedings of the Proc. ICCV, 2023.
- Yuan, S.; et al. Identity-Preserving Text-to-Video Generation by Frequency Decomposition. In Proceedings of the Proc. CVPR, 2025.
- Chen, H.; et al. VideoCrafter: Open Diffusion Models for High-Quality Video Generation. arXiv preprint arXiv:2310.09512 2023.
- Wang, X.; et al. VideoCrafter 2: Overcoming Data Limitations for High-Quality Video Diffusion Models. In Proceedings of the Proc. CVPR, 2024.
- Villegas, R.; et al. Phenaki: Variable Length Video Generation from Open Domain Textual Description. arXiv preprint arXiv:2210.02399 2022.
- He, Y.; et al. ID-Animator: Zero-Shot Identity-Preserving Human Video Generation. arXiv preprint arXiv:2306.07899 2023.
- Lin, C.; et al. DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation. In Proceedings of the Proc. ICLR, 2025.
- Rosinol, A.; Gupta, A.; Abate, M.; Shi, J.; Carlone, L. 3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans. In Proceedings of the Proc. RSS, 2020.
- Akbari, H.; Yuan, L.; et al. VATT: Transformers for Multimodal Self-Supervised Learning from Video and Audio. In Proceedings of the Proc. ICCV, 2021.
- Zhang, T.; et al. Physics-Based Interaction with 3D Objects via Video Generation. In Proceedings of the Proc. ECCV, 2024.
- Yuan, Y.; Song, J.; Iqbal, U.; Vahdat, A.; Kautz, J. PhysDiff: Physics-Guided Human Motion Diffusion Model. In Proceedings of the Proc. ICCV, 2023.
- Ma, Y.; et al. Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos. In Proceedings of the Proc. AAAI, 2024.
- Li, Y.; Xia, X.; Xiao, X.; Lin, L. Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models. arXiv preprint arXiv:2305.13840 2023.
- Huang, H.P.; et al. Fine-grained Controllable Video Generation via Object Appearance and Context. In Proceedings of the Proc. WACV, 2025.
- Jiang, Z.; et al. VACE: All-in-One Video Creation and Editing. arXiv preprint arXiv:2503.07598 2025.
- Zhao, R.; et al. MotionDirector: Motion Customization of Text-to-Video Diffusion Models. In Proceedings of the Proc. ECCV, 2025.
- Kang, Y.; Huang, S.; Song, Y. Robust and Interactive Localized 3D Gaussian Editing with Geometry-Consistent Attention Prior.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).