Submitted:
18 March 2025
Posted:
19 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Dataset and Features
4. Methods
4.1. Video Generation with NVIDIA Cosmos
4.2. Quality Metrics
4.2.1. Peak Signal-to-Noise Ratio (PSNR)
4.2.2. Structural Similarity Index (SSIM)
4.2.3. Video Multi-method Assessment Fusion (VMAF)
5. Experiments, Results, and Discussion
5.1. Experimental Setup
- Prompts: One fixed positive and one fixed negative prompt.
-
Generation Parameters:
- Diffusion Steps: 20
- CFG Scale: 7.5
- Resolution: 1280×704
- Video Length: 121 frames
- Frame Rate: 12 fps
- Samplers: Euler ancestral vs. Res multistep
- Number of Videos: 50 per sampler
- Evaluation Metrics: PSNR, SSIM, VMAF
5.2. Euler Ancestral Sampler Results
- PSNR: Approximately 14.0 dB
- SSIM: Ranges from 0.04 to 0.07
- VMAF: Between 0.000 and 0.009



5.3. Res Multistep Sampler Results
- PSNR: Ranges from 14.1 to 14.6 dB
- SSIM: Ranges from 0.64 to 0.65
- VMAF: Between 0.0050 and 0.0054



5.4. Comparative Analysis
- Structural Similarity: Res multistep achieves SSIM values roughly 10 times higher than Euler ancestral, indicating vastly better preservation of structure [8].
- Pixel-Level Fidelity: Both samplers show similar PSNR values, suggesting comparable pixel-wise reconstruction errors [7].
- Perceptual Quality: Despite near-zero VMAF scores for both, the slight improvement with Res multistep indicates marginally better perceptual alignment.
| Sampler | PSNR (dB) | SSIM | VMAF |
|---|---|---|---|
| Euler ancestral | ∼14.0 | 0.04 – 0.07 | 0.000 – 0.009 |
| Res multistep | 14.1 – 14.6 | 0.64 – 0.65 | 0.0050 – 0.0054 |
- Multi-Step Refinement: Res multistep refines the denoising process in multiple sub-steps, leading to enhanced structural stability [6].
- Artifact Reduction: This iterative approach helps reduce high-frequency artifacts, resulting in a significantly improved SSIM.
- Temporal Consistency: Improved frame-to-frame coherence is achieved, although further work is needed to boost VMAF scores.
6. Conclusion and Future Work
6.1. Final Summary
6.2. Future Work
- Parameter Exploration: Varying the number of diffusion steps and CFG scale to examine broader performance trade-offs.
- Additional Metrics: Incorporating video FID and temporal flow consistency to better capture perceptual quality.
- Reference Alignment: Employing more suitable or domain-specific reference videos to yield more meaningful VMAF evaluations.
- Real-Time Enhancement: Investigating algorithmic and hardware optimizations to achieve near-real-time text-to-video generation.
References
- Li, F. Efficient Adaptive Parameter Tuning for Real-Time Text-to-Video Generation using NVIDIA Cosmos Diffusion Models. In Proceedings of the Milestone Report; 2023; pp. 93–95. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022; pp. 10684–10695. [Google Scholar]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International conference on machine learning. pmlr; 2015; pp. 2256–2265. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems 2020, 33, 6840–6851. [Google Scholar]
- Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International conference on machine learning. PMLR; 2021; pp. 8162–8171. [Google Scholar]
- Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems 2022, 35, 26565–26577. [Google Scholar]
- Gonzalez, R.C. Digital image processing; Pearson education india, 2009.
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Netflix. VMAF: Video Multi-Method Assessment Fusion. https://github.com/Netflix/vmaf, 2016. Accessed: 2023-10-01.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).