Submitted:
10 June 2026
Posted:
11 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We define consistency in diffusion-based generation as an overlap-aware family of agreement requirements, rather than a task-specific label, visual-quality proxy, or single metric.
- We organize the literature into external, internal, and normative consistency, compare methods through four auxiliary axes, and use secondary-relation annotations to mark overlap.
- We connect major consistency targets with mechanisms, evaluation protocols, benchmarks, failure modes, and known metric blind spots.
- We identify trade-offs when multiple consistency requirements must be satisfied jointly, including conflicts among faithfulness, identity preservation, safety, preference, temporal stability, and physical plausibility.
2. Background and Problem Formulation
2.1. Consistency as a Generative Requirement
2.2. Three Primary Consistency Relations
2.3. How Consistency Can Be Evaluated
2.4. Where Consistency Can Be Optimized
3. External Consistency
3.1. Prompt and Compositional Consistency
3.2. Grounded and Structural Consistency
3.3. Edit Consistency
4. Internal Consistency
4.1. Subject and Identity Consistency
4.2. Multi-View and 3D Consistency
4.3. Temporal and Narrative Consistency
5. Normative Consistency
5.1. Preference and Aesthetic Consistency
5.2. Safety and Value Consistency
5.3. Physical, Commonsense, and Causal Consistency
| Resource | Code | Type | Mod. | Primary | P/C | S/E | ID | V/T | N/S | P/W | Diagnostic use and blind spot |
|---|---|---|---|---|---|---|---|---|---|---|---|
| External Consistency Resources | |||||||||||
| TIFA [50] | ![]() |
Eval. | T2I | Ext. | H | M | L | L | L | L | QA-based prompt faithfulness; weak on identity, temporal, and world consistency. |
| GenEval [41] | ![]() |
Bench. | T2I | Ext. | H | M | L | L | L | L | Object-focused prompt benchmark; covers omission, binding, and counting, not editing or temporal consistency. |
| T2I-CompBench [42] | ![]() |
Bench. | T2I | Ext. | H | M | L | L | L | L | Broad compositional benchmark for relations and attributes; weaker on grounded control and preservation-based editing. |
| GenEval 2 [109] | ![]() |
Bench. | T2I | Ext. | H | M | L | L | L | L | Harder prompt-following resource addressing benchmark drift; still centered on image-level prompt faithfulness. |
| HRS-Bench [178] | ![]() |
Bench. | T2I | Ext. | H | M | L | L | M | L | Holistic T2I benchmark covering accuracy, robustness, generalization, fairness, and bias; useful for broad prompt-skill diagnostics but less specific to edit preservation. |
| DPG-Bench [179] | ![]() |
Bench. | T2I | Ext. | H | M | L | L | L | L | Dense-prompt benchmark for long and complex prompt following; strong for entity, attribute, and relation grounding, but not designed for editing or temporal consistency. |
| GenAI-Bench [180] | ![]() |
Bench./Eval. | T2I/T2V | Ext. | H | M | L | M | L | L | Compositional text-to-visual benchmark with VQA-style scoring; useful for external semantic agreement, but limited for identity, safety, and physical-state persistence. |
| EditBench [181] | – | Bench. | Edit | Ext. | M | H | L | L | L | L | Text-guided image inpainting benchmark; directly probes instruction adherence and preservation in masked editing, but less general for open-ended editing. |
| MagicBrush [182] | ![]() |
Dataset/Bench. | Edit | Ext. | M | H | L | L | L | L | Manually annotated instruction-guided editing dataset with single-turn, multi-turn, mask-provided, and mask-free settings; strong for edit consistency but not for world or temporal plausibility. |
| ConceptBed [130] | ![]() |
Bench. | T2I | Ext./Int. | M | M | M | L | L | L | Concept-learning and binding benchmark; useful bridge between prompt-level consistency and reusable subject concepts. |
| Internal Consistency Resources | |||||||||||
| MVG-Bench [84] | ![]() |
Bench. | T2I/3D | Int. | L | M | M | H | L | L | Dedicated benchmark for multi-view generation; emphasizes cross-view compatibility more than single-image realism. |
| MET3R [51] | ![]() |
Eval. | T2I/3D | Int. | L | L | M | H | L | L | Multi-view consistency metric; strong for geometry-aware agreement, weaker for narrative or preference alignment. |
| VBench [183] | ![]() |
Bench./Eval. | T2V | Int./Norm. | M | L | M | H | L | M | Comprehensive video-generation benchmark with dimensions such as subject consistency, background consistency, motion smoothness, and temporal flickering; broader than internal consistency but not designed for safety or preference alignment. |
| Video-Bench [184] | ![]() |
Bench. | T2V/I2V | Int./Norm. | M | L | M | H | M | M | Human-aligned video-generation benchmark; useful for action consistency, temporal consistency, and motion quality, but not designed specifically for identity-preserving story generation. |
| EvalCrafter [185] | ![]() |
Eval. | T2V | Int./Ext. | M | L | L | H | L | L | Unified evaluation toolkit for generated videos using visual, content, motion, and text-video alignment metrics; strong for video-level diagnostics but weaker for persistent identity, preference alignment, and causal state. |
| FETV [186] | ![]() |
Bench. | T2V | Ext./Int. | H | L | L | M | L | L | Fine-grained open-domain T2V benchmark; useful for prompt complexity, attribute control, and temporal generation quality, but less focused on long-horizon identity or narrative memory. |
| ViStoryBench [187] | ![]() |
Bench. | Story/T2I | Int. | M | L | H | H | L | L | Story-visualization benchmark focusing on character consistency, narrative coherence, and stylistic integrity across image sequences; strong for long-range internal consistency but not for physical dynamics. |
| MeViS [161] | ![]() |
Dataset | Video | Int. | L | L | L | M | L | L | Motion-expression segmentation; useful for temporal grounding diagnostics, but not a generative benchmark. |
| MOSE [164] | ![]() |
Dataset | Video | Int. | L | L | L | M | L | L | Video object segmentation resource for difficult scenes; useful for analyzing persistence under occlusion and scene change, but less direct than MeViS for motion-expression grounding. |
| TAO [188] | ![]() |
Dataset | Video | Int. | L | L | M | M | L | L | Large-scale tracking benchmark; useful for long-range identity diagnostics, but not diffusion-specific. |
| VSPW [189] | ![]() |
Dataset | Video | Int. | L | L | L | M | L | L | Video scene parsing resource; useful for scene-state continuity diagnostics rather than prompt following. |
| nuScenes [190] | ![]() |
Dataset | Video/3D | Int. | L | M | L | M | L | M | Multimodal driving dataset; useful for geometry and dynamics diagnostics, but not a standalone generative benchmark. |
| Normative Consistency Resources | |||||||||||
| Pick-a-Pic [93] | ![]() |
Dataset | T2I | Norm. | M | L | L | L | H | L | Pairwise preference data; strong for preference learning, indirect for compositional faithfulness. |
| ImageReward [53] | ![]() |
Eval. | T2I | Norm. | M | L | L | L | H | L | Learned preference reward; useful for ranking and optimization, but not for task-specific faithfulness. |
| HPS [90] | – | Eval. | T2I | Norm. | M | L | L | L | H | L | Human preference score; broad but coarse-grained. |
| HPSv2 [165] | ![]() |
Bench. | T2I | Norm. | M | L | L | L | H | L | Refined human-preference benchmark with stronger evaluation coverage than HPS. |
| HPSv3 [166] | ![]() |
Bench. | T2I | Norm. | M | L | L | L | H | L | Wide-spectrum human preference benchmark; still not designed for safety or world-consistency claims. |
| VisionReward [92] | ![]() |
Eval. | T2I/T2V | Norm. | M | L | L | M | H | L | Multi-dimensional image/video preference evaluator; broader than image-only rewards, but still preference-centered. |
| Six-CD [101] | ![]() |
Bench. | T2I | Norm. | L | M | L | L | H | L | Safety benchmark that jointly measures concept suppression and benign retention. |
| PhyBench [102] | ![]() |
Bench. | T2I | Norm. | L | L | L | L | L | H | Physical commonsense benchmark for text-to-image generation; useful for static world plausibility more than temporal dynamics. |
| VideoPhy [39] | ![]() |
Bench. | T2V | Norm. | L | L | L | M | L | H | Physical commonsense evaluation for generated videos. |
| PhyCoBench [191] | ![]() |
Bench. | T2V | Norm. | L | L | L | M | L | H | Optical-flow-guided physical coherence benchmark; focused on motion plausibility. |
| PhyGenBench [54] | ![]() |
Bench. | T2V | Norm. | L | L | L | M | L | H | Physical commonsense benchmark for video generation; emphasizes world simulation quality. |
| VideoPhy-2 [103] | ![]() |
Bench. | T2V | Norm. | L | L | L | M | L | H | Action-centric physical commonsense benchmark; strong for action consequences and physical interaction. |
| T2VPhysBench [170] | – | Bench. | T2V | Norm. | L | L | L | M | L | H | First-principles physical consistency benchmark for text-to-video. |
| T2VWorldBench [104] | – | Bench. | T2V | Norm. | L | L | L | M | L | H | World-knowledge benchmark covering commonsense and causal plausibility beyond local motion realism. |
| Physics-IQ [106] | ![]() |
Bench. | T2V | Norm. | L | L | L | M | L | H | Probes whether video generators internalize physical principles; diagnostic rather than end-to-end preference metric. |
| PhyWorldBench [105] | ![]() |
Bench. | T2V | Norm. | L | L | L | M | L | H | Comprehensive physical realism benchmark for text-to-video. |
| VideoVerse [171] | ![]() |
Bench. | T2V | Norm. | L | L | L | M | L | H | World-model-oriented T2V evaluation. |
| PhyEduVideo [192] | ![]() |
Dataset/Bench. | T2V | Norm. | L | L | L | M | L | H | Physics-education-oriented benchmark exposing explanatory and world-consistency gaps. |
6. Open Problems and Future Directions
6.1. Conflict-Aware Multi-Target Optimization
6.2. Decomposed and Interpretable Evaluation
6.3. Persistent Shared State and User Memory
6.4. World-Grounded Physical and Causal Consistency
6.5. Composable Mechanisms
7. Conclusion
Supplementary Materials
References
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the NeurIPS; Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; Lin, H., Eds., 2020.
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the ICLR. OpenReview.net, 2021.
- Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. In Proceedings of the ICLR. OpenReview.net, 2021.
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the CVPR. IEEE, 2022, pp. 10674–10685. [CrossRef]
- Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; Rombach, R. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In Proceedings of the ICLR. OpenReview.net, 2024.
- Labs, B.F.; Batifol, S.; Blattmann, A.; Boesel, F.; Consul, S.; Diagne, C.; Dockhorn, T.; English, J.; English, Z.; Esser, P.; et al. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space. CoRR 2025, abs/2506.15742, [2506.15742]. [CrossRef]
- Wu, C.; Li, J.; Zhou, J.; Lin, J.; Gao, K.; Yan, K.; Yin, S.; Bai, S.; Xu, X.; Chen, Y.; et al. Qwen-Image Technical Report. CoRR 2025, abs/2508.02324, [2508.02324]. [CrossRef]
- Huang, Y.; Huang, J.; Liu, Y.; Yan, M.; Lv, J.; Liu, J.; Xiong, W.; Zhang, H.; Cao, L.; Chen, S. Diffusion Model-Based Image Editing: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4409–4437. [CrossRef]
- Zhang, X.; Wei, X.; Hu, W.; Wu, J.; Wu, J.; Zhang, W.; Zhang, Z.; Lei, Z.; Li, Q. A Survey on Personalized Content Synthesis with Diffusion Models. Mach. Intell. Res. 2025, 22, 817–848. [CrossRef]
- Gao, Y.; Guo, H.; Hoang, T.; Huang, W.; Jiang, L.; Kong, F.; Li, H.; Li, J.; Li, L.; Li, X.; et al. Seedance 1.0: Exploring the Boundaries of Video Generation Models. CoRR 2025, abs/2506.09113, [2506.09113]. [CrossRef]
- Wang, A.; Ai, B.; Wen, B.; Mao, C.; Xie, C.; Chen, D.; Yu, F.; Zhao, H.; Yang, J.; Zeng, J.; et al. Wan: Open and Advanced Large-Scale Video Generative Models. CoRR 2025, abs/2503.20314, [2503.20314]. [CrossRef]
- Xiang, J.; Lv, Z.; Xu, S.; Deng, Y.; Wang, R.; Zhang, B.; Chen, D.; Tong, X.; Yang, J. Structured 3D Latents for Scalable and Versatile 3D Generation. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 21469–21480. [CrossRef]
- Zhao, Z.; Lai, Z.; Lin, Q.; Zhao, Y.; Liu, H.; Yang, S.; Feng, Y.; Yang, M.; Zhang, S.; Yang, X.; et al. Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. CoRR 2025, abs/2501.12202, [2501.12202]. [CrossRef]
- Bruce, J.; Dennis, M.D.; Edwards, A.; Parker-Holder, J.; Shi, Y.; Hughes, E.; Lai, M.; Mavalankar, A.; Steigerwald, R.; Apps, C.; et al. Genie: Generative Interactive Environments. In Proceedings of the ICML; Salakhutdinov, R.; Kolter, Z.; Heller, K.A.; Weller, A.; Oliver, N.; Scarlett, J.; Berkenkamp, F., Eds. PMLR / OpenReview.net, 2024, Proceedings of Machine Learning Research, pp. 4603–4623.
- Team, H.; Wang, Z.; Liu, Y.; Wu, J.; Gu, Z.; Wang, H.; Zuo, X.; Huang, T.; Li, W.; Zhang, S.; et al. HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels. CoRR 2025, abs/2507.21809, [2507.21809]. [CrossRef]
- Zhang, C.; Zhang, C.; Zhang, M.; Kweon, I.S. Text-to-image Diffusion Models in Generative AI: A Survey. CoRR 2023, abs/2303.07909, [2303.07909]. [CrossRef]
- Cao, P.; Zhou, F.; Song, Q.; Yang, L. Controllable Generation With Text-to-Image Diffusion Models: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 4771–4791. [CrossRef]
- Wang, Y.; Liu, X.; Pang, W.; Ma, L.; Yuan, S.; Debevec, P.E.; Yu, N. Survey of Video Diffusion Models: Foundations, Implementations, and Applications. Trans. Mach. Learn. Res. 2025, 2025.
- Elmoghany, M.; Rossi, R.A.; Yoon, S.; Mukherjee, S.; Bakr, E.M.; Mathur, P.; Wu, G.; Lai, V.D.; Lipka, N.; Zhang, R.; et al. A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality. In Proceedings of the ICCV Workshops. IEEE, 2025, pp. 7082–7094. [CrossRef]
- Liu, B.; Shao, S.; Li, B.; Bai, L.; Xu, Z.; Xiong, H.; Kwok, J.T.; Helal, S.; Xie, Z. Alignment of Diffusion Models: Fundamentals, Challenges, and Future. ACM Comput. Surv. 2026, 58, 244:1–244:37. [CrossRef]
- Wang, Z.; Li, D.; Jiang, R. Diffusion Models in 3D Vision: A Survey. CoRR 2024, abs/2410.04738, [2410.04738]. [CrossRef]
- Miao, Q.; Li, K.; Quan, J.; Min, Z.; Ma, S.; Xu, Y.; Yang, Y.; Luo, Y. Advances in 4D Generation: A Survey. CoRR 2025, abs/2503.14501, [2503.14501]. [CrossRef]
- Liu, D.; Zhang, J.; Dinh, A.; Park, E.; Zhang, S.; Xu, C. Generative Physical AI in Vision: A Survey. CoRR 2025, abs/2501.10928, [2501.10928]. [CrossRef]
- Hartwig, S.; Engel, D.; Sick, L.; Kniesel, H.; Payer, T.; Poonam, P.; Glöckler, M.; Bäuerle, A.; Ropinski, T. A Survey on Quality Metrics for Text-to-Image Generation. TVCG 2025, 31, 9464–9483. [CrossRef]
- Xu, Y.; Zhang, J.; Salemi, A.; Hu, X.; Wang, W.; Feng, F.; Zamani, H.; He, X.; Chua, T. Personalized Generation In Large Model Era: A Survey. In Proceedings of the ACL; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds. Association for Computational Linguistics, 2025, pp. 24607–24649.
- Wei, Y.; Zheng, Y.; Zhang, Y.; Liu, M.; Ji, Z.; Zhang, L.; Zuo, W. Personalized Image Generation with Deep Generative Models: A Decade Survey. Comput. Vis. Media 2025, 11, 1141–1194. [CrossRef]
- Xing, Z.; Feng, Q.; Chen, H.; Dai, Q.; Hu, H.; Xu, H.; Wu, Z.; Jiang, Y. A Survey on Video Diffusion Models. ACM Comput. Surv. 2025, 57, 41:1–41:42. [CrossRef]
- Zhang, Y.; Chen, Z.; Cheng, C.; Ruan, W.; Huang, X.; Zhao, D.; Flynn, D.; Khastgir, S.; Zhao, X. Trustworthy text-to-image diffusion models: A timely and focused survey. Inf. Fusion 2026, 133, 104264. [CrossRef]
- Truong, V.T.; Dang, B.L.; Le, L.B. Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey. ACM Comput. Surv. 2025, 57, 216:1–216:44. [CrossRef]
- Li, X.; He, X.; Zhang, L.; Liu, Y. A Comprehensive Survey on World Models for Embodied AI. CoRR 2025, abs/2510.16732, [2510.16732]. [CrossRef]
- Chefer, H.; Alaluf, Y.; Vinker, Y.; Wolf, L.; Cohen-Or, D. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. ACM Trans. Graph. 2023, 42, 148:1–148:10. [CrossRef]
- Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the ICCV. IEEE, 2023, pp. 3813–3824. [CrossRef]
- Brooks, T.; Holynski, A.; Efros, A.A. InstructPix2Pix: Learning to Follow Image Editing Instructions. In Proceedings of the CVPR. IEEE, 2023, pp. 18392–18402. [CrossRef]
- Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the CVPR. IEEE, 2023, pp. 22500–22510. [CrossRef]
- Shi, Y.; Wang, P.; Ye, J.; Mai, L.; Li, K.; Yang, X. MVDream: Multi-view Diffusion for 3D Generation. In Proceedings of the ICLR. OpenReview.net, 2024.
- Guo, Y.; Yang, C.; Rao, A.; Liang, Z.; Wang, Y.; Qiao, Y.; Agrawala, M.; Lin, D.; Dai, B. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. In Proceedings of the ICLR. OpenReview.net, 2024.
- Wallace, B.; Dang, M.; Rafailov, R.; Zhou, L.; Lou, A.; Purushwalkam, S.; Ermon, S.; Xiong, C.; Joty, S.; Naik, N. Diffusion Model Alignment Using Direct Preference Optimization. In Proceedings of the CVPR. IEEE, 2024, pp. 8228–8238. [CrossRef]
- Schramowski, P.; Brack, M.; Deiseroth, B.; Kersting, K. Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 22522–22531. [CrossRef]
- Bansal, H.; Lin, Z.; Xie, T.; Zong, Z.; Yarom, M.; Bitton, Y.; Jiang, C.; Sun, Y.; Chang, K.; Grover, A. VideoPhy: Evaluating Physical Commonsense for Video Generation. In Proceedings of the ICLR. OpenReview.net, 2025.
- Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. Consistency Models. In Proceedings of the ICML; Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; Scarlett, J., Eds. PMLR, 2023, Proceedings of Machine Learning Research, pp. 32211–32252.
- Ghosh, D.; Hajishirzi, H.; Schmidt, L. GenEval: An object-focused framework for evaluating text-to-image alignment. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Huang, K.; Sun, K.; Xie, E.; Li, Z.; Liu, X. T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Prompt-to-Prompt Image Editing with Cross-Attention Control. In Proceedings of the ICLR. OpenReview.net, 2023.
- Kim, G.; Park, H.; Kim, T. Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift. In Proceedings of the ICLR, 2026.
- Blattmann, A.; Rombach, R.; Ling, H.; Dockhorn, T.; Kim, S.W.; Fidler, S.; Kreis, K. Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 22563–22575. [CrossRef]
- Rahman, T.; Lee, H.; Ren, J.; Tulyakov, S.; Mahajan, S.; Sigal, L. Make-A-Story: Visual Memory Conditioned Consistent Story Generation. In Proceedings of the CVPR. IEEE, 2023, pp. 2493–2502.
- Zhou, Y.; Zhou, D.; Cheng, M.; Feng, J.; Hou, Q. StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation. In Proceedings of the NeurIPS; Globersons, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.M.; Zhang, C., Eds., 2024.
- Li, Y.; Liu, H.; Wu, Q.; Mu, F.; Yang, J.; Gao, J.; Li, C.; Lee, Y.J. GLIGEN: Open-Set Grounded Text-to-Image Generation. In Proceedings of the CVPR. IEEE, 2023, pp. 22511–22521. [CrossRef]
- Shin, J.; Li, Z.; Zhang, R.; Zhu, J.; Park, J.; Schechtman, E.; Huang, X. MotionStream: Real-Time Video Generation with Interactive Motion Controls. CoRR 2025, abs/2511.01266, [2511.01266]. [CrossRef]
- Hu, Y.; Liu, B.; Kasai, J.; Wang, Y.; Ostendorf, M.; Krishna, R.; Smith, N.A. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. In Proceedings of the ICCV. IEEE, 2023, pp. 20349–20360. [CrossRef]
- Asim, M.; Wewer, C.; Wimmer, T.; Schiele, B.; Lenssen, J.E. MET3R: Measuring Multi-View Consistency in Generated Images. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 6034–6044. [CrossRef]
- Li, Z.; Cao, M.; Wang, X.; Qi, Z.; Cheng, M.; Shan, Y. PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding. In Proceedings of the CVPR. IEEE, 2024, pp. 8640–8650. [CrossRef]
- Xu, J.; Liu, X.; Wu, Y.; Tong, Y.; Li, Q.; Ding, M.; Tang, J.; Dong, Y. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Meng, F.; Liao, J.; Tan, X.; Lu, Q.; Shao, W.; Zhang, K.; Cheng, Y.; Li, D.; Luo, P. Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation. In Proceedings of the ICML; Singh, A.; Fazel, M.; Hsu, D.; Lacoste-Julien, S.; Berkenkamp, F.; Maharaj, T.; Wagstaff, K.; Zhu, J., Eds. PMLR / OpenReview.net, 2025, Proceedings of Machine Learning Research.
- Lin, H.; Cho, J.; Zala, A.; Bansal, M. Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model. In Proceedings of the ICLR. OpenReview.net, 2025.
- Guo, X.; Ma, X.; Zhang, H.; Huang, D. CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration. CoRR 2026, abs/2603.20741, [2603.20741]. [CrossRef]
- Lee, K.; Li, X.; Wang, Q.; He, J.; Ke, J.; Yang, M.; Essa, I.; Shin, J.; Yang, F.; Li, Y. Calibrated Multi-Preference Optimization for Aligning Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 2025, pp. 18465–18475. [CrossRef]
- Xie, J.; Li, Y.; Huang, Y.; Liu, H.; Zhang, W.; Zheng, Y.; Shou, M.Z. BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion. In Proceedings of the ICCV. IEEE, 2023, pp. 7418–7427. [CrossRef]
- Huang, L.; Chen, D.; Liu, Y.; Shen, Y.; Zhao, D.; Zhou, J. Composer: Creative and Controllable Image Synthesis with Composable Conditions. In Proceedings of the ICML; Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; Scarlett, J., Eds. PMLR, 2023, Proceedings of Machine Learning Research, pp. 13753–13773.
- Binyamin, L.; Tewel, Y.; Segev, H.; Hirsch, E.; Rassin, R.; Chechik, G. Make It Count: Text-to-Image Generation with an Accurate Number of Objects. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 13242–13251. [CrossRef]
- Li, Y.; Wan, P.; Han, L.; Wang, Y.; Nie, L.; Zhang, M. CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion. CoRR 2025, abs/2505.04347, [2505.04347]. [CrossRef]
- Zeng, G.; Zhang, X.; Wang, Z.; Xu, H.; Chen, Z.; Li, B.; Tu, Z. YOLO-Count: Differentiable Object Counting for Text-to-Image Generation. CoRR 2025, abs/2508.00728, [2508.00728]. [CrossRef]
- Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y. T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models. In Proceedings of the AAAI; Wooldridge, M.J.; Dy, J.G.; Natarajan, S., Eds. AAAI Press, 2024, pp. 4296–4304. [CrossRef]
- Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. CoRR 2023, abs/2308.06721, [2308.06721]. [CrossRef]
- Chen, X.; Huang, L.; Liu, Y.; Shen, Y.; Zhao, D.; Zhao, H. AnyDoor: Zero-shot Object-level Image Customization. In Proceedings of the CVPR. IEEE, 2024, pp. 6593–6602. [CrossRef]
- Yu, J.; Wang, Y.; Zhao, C.; Ghanem, B.; Zhang, J. FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model. In Proceedings of the ICCV. IEEE, 2023, pp. 23117–23127. [CrossRef]
- Ju, X.; Zeng, A.; Zhao, C.; Wang, J.; Zhang, L.; Xu, Q. HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation. In Proceedings of the ICCV. IEEE, 2023, pp. 15942–15952. [CrossRef]
- Joung, W.; Chae, D.; Kim, J. SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet. CoRR 2025, abs/2509.21938, [2509.21938]. [CrossRef]
- Couairon, G.; Verbeek, J.; Schwenk, H.; Cord, M. DiffEdit: Diffusion-based semantic image editing with mask guidance. In Proceedings of the ICLR. OpenReview.net, 2023.
- Geng, Z.; Yang, B.; Hang, T.; Li, C.; Gu, S.; Zhang, T.; Bao, J.; Zhang, Z.; Li, H.; Hu, H.; et al. InstructDiffusion: A Generalist Modeling Interface for Vision Tasks. In Proceedings of the CVPR. IEEE, 2024, pp. 12709–12720. [CrossRef]
- Mokady, R.; Hertz, A.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Null-text Inversion for Editing Real Images using Guided Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 6038–6047. [CrossRef]
- Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-Based Real Image Editing with Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 6007–6017. [CrossRef]
- Yang, B.; Gu, S.; Zhang, B.; Zhang, T.; Chen, X.; Sun, X.; Chen, D.; Wen, F. Paint by Example: Exemplar-based Image Editing with Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 18381–18391. [CrossRef]
- Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; Zhu, J. Multi-Concept Customization of Text-to-Image Diffusion. In Proceedings of the CVPR. IEEE, 2023, pp. 1931–1941. [CrossRef]
- Li, D.; Li, J.; Hoi, S.C.H. BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Wang, Q.; Bai, X.; Wang, H.; Qin, Z.; Chen, A. InstantID: Zero-shot Identity-Preserving Generation in Seconds. CoRR 2024, abs/2401.07519, [2401.07519]. [CrossRef]
- Tewel, Y.; Kaduri, O.; Gal, R.; Kasten, Y.; Wolf, L.; Chechik, G.; Atzmon, Y. Training-Free Consistent Text-to-Image Generation. ACM Trans. Graph. 2024, 43, 52:1–52:18. [CrossRef]
- Liu, R.; Wu, R.; Hoorick, B.V.; Tokmakov, P.; Zakharov, S.; Vondrick, C. Zero-1-to-3: Zero-shot One Image to 3D Object. In Proceedings of the ICCV. IEEE, 2023, pp. 9264–9275. [CrossRef]
- Chen, Y.; Fang, J.; Huang, Y.; Yi, T.; Zhang, X.; Xie, L.; Wang, X.; Dai, W.; Xiong, H.; Tian, Q. Cascade-Zero123: One Image to Highly Consistent 3D with Self-prompted Nearby Views. In Proceedings of the ECCV; Leonardis, A.; Ricci, E.; Roth, S.; Russakovsky, O.; Sattler, T.; Varol, G., Eds. Springer, 2024, Lecture Notes in Computer Science, pp. 311–330. [CrossRef]
- Liu, Y.; Lin, C.; Zeng, Z.; Long, X.; Liu, L.; Komura, T.; Wang, W. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. In Proceedings of the ICLR. OpenReview.net, 2024.
- Long, X.; Guo, Y.; Lin, C.; Liu, Y.; Dou, Z.; Liu, L.; Ma, Y.; Zhang, S.; Habermann, M.; Theobalt, C.; et al. Wonder3D: Single Image to 3D Using Cross-Domain Diffusion. In Proceedings of the CVPR. IEEE, 2024, pp. 9970–9980. [CrossRef]
- Höllein, L.; Bozic, A.; Müller, N.; Novotný, D.; Tseng, H.; Richardt, C.; Zollhöfer, M.; Nießner, M. ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models. In Proceedings of the CVPR. IEEE, 2024, pp. 5043–5052. [CrossRef]
- Kong, X.; Liu, S.; Lyu, X.; Taher, M.; Qi, X.; Davison, A.J. EscherNet: A Generative Model for Scalable View Synthesis. In Proceedings of the CVPR. IEEE, 2024, pp. 9503–9513. [CrossRef]
- Xie, X.; Zou, C.; Karumuri, M.G.; Lenssen, J.E.; Pons-Moll, G. MVGBench: Comprehensive Benchmark for Multi-view Generation Models. CoRR 2025, abs/2507.00006, [2507.00006]. [CrossRef]
- Qi, C.; Cun, X.; Zhang, Y.; Lei, C.; Wang, X.; Shan, Y.; Chen, Q. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. In Proceedings of the ICCV. IEEE, 2023, pp. 15886–15896. [CrossRef]
- Liu, S.; Zhang, Y.; Li, W.; Lin, Z.; Jia, J. Video-P2P: Video Editing with Cross-Attention Control. In Proceedings of the CVPR. IEEE, 2024, pp. 8599–8608. [CrossRef]
- Gong, Y.; Pang, Y.; Cun, X.; Xia, M.; He, Y.; Chen, H.; Wang, L.; Zhang, Y.; Wang, X.; Shan, Y.; et al. TaleCrafter: Interactive Story Visualization with Multiple Characters. CoRR 2023, abs/2305.18247, [2305.18247]. [CrossRef]
- Liu, T.; Wang, K.; Li, S.; van de Weijer, J.; Khan, F.S.; Yang, S.; Wang, Y.; Yang, J.; Cheng, M. One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt. In Proceedings of the ICLR. OpenReview.net, 2025.
- Zhao, C.; Liu, M.; Wang, W.; Chen, W.; Wang, F.; Chen, H.; Zhang, B.; Shen, C. MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences. In Proceedings of the ICLR. OpenReview.net, 2025.
- Wu, X.; Sun, K.; Zhu, F.; Zhao, R.; Li, H. Human Preference Score: Better Aligning Text-to-image Models with Human Preference. In Proceedings of the ICCV. IEEE, 2023, pp. 2096–2105. [CrossRef]
- Zhang, S.; Wang, B.; Wu, J.; Li, Y.; Gao, T.; Zhang, D.; Wang, Z. Learning Multi-Dimensional Human Preference for Text-to-Image Generation. In Proceedings of the CVPR. IEEE, 2024, pp. 8018–8027. [CrossRef]
- Xu, J.; Huang, Y.; Cheng, J.; Yang, Y.; Xu, J.; Wang, Y.; Duan, W.; Yang, S.; Jin, Q.; Li, S.; et al. VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation. In Proceedings of the AAAI; Koenig, S.; Jenkins, C.; Taylor, M.E., Eds. AAAI Press, 2026, pp. 11269–11277. [CrossRef]
- Kirstain, Y.; Polyak, A.; Singer, U.; Matiana, S.; Penna, J.; Levy, O. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Liang, Z.; Yuan, Y.; Gu, S.; Chen, B.; Hang, T.; Cheng, M.; Li, J.; Zheng, L. Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 13199–13208. [CrossRef]
- Liu, J.; Liu, G.; Liang, J.; Li, Y.; Liu, J.; Wang, X.; Wan, P.; Zhang, D.; Ouyang, W. Flow-GRPO: Training Flow Matching Models via Online RL. CoRR 2025, abs/2505.05470, [2505.05470]. [CrossRef]
- Zheng, K.; Chen, H.; Ye, H.; Wang, H.; Zhang, Q.; Jiang, K.; Su, H.; Ermon, S.; Zhu, J.; Liu, M. DiffusionNFT: Online Diffusion Reinforcement with Forward Process. CoRR 2025, abs/2509.16117, [2509.16117]. [CrossRef]
- Gandikota, R.; Orgad, H.; Belinkov, Y.; Materzynska, J.; Bau, D. Unified Concept Editing in Diffusion Models. In Proceedings of the WACV. IEEE, 2024, pp. 5099–5108. [CrossRef]
- Wang, Z.; Wei, Y.; Li, F.; Pei, R.; Xu, H.; Zuo, W. ACE: Anti-Editing Concept Erasure in Text-to-Image Models. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 23505–23515. [CrossRef]
- Gandikota, R.; Materzynska, J.; Fiotto-Kaufman, J.; Bau, D. Erasing Concepts from Diffusion Models. In Proceedings of the ICCV. IEEE, 2023, pp. 2426–2436. [CrossRef]
- Xiong, T.; Wu, Y.; Xie, E.; Wu, Y.; Li, Z.; Liu, X. Editing Massive Concepts in Text-to-Image Diffusion Models. CoRR 2024, abs/2403.13807, [2403.13807]. [CrossRef]
- Ren, J.; Chen, K.; Cui, Y.; Zeng, S.; Liu, H.; Xing, Y.; Tang, J.; Lyu, L. Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models. CoRR 2024, abs/2406.14855, [2406.14855]. [CrossRef]
- Meng, F.; Shao, W.; Luo, L.; Wang, Y.; Chen, Y.; Lu, Q.; Yang, Y.; Yang, T.; Zhang, K.; Qiao, Y.; et al. PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models. CoRR 2024, abs/2406.11802, [2406.11802]. [CrossRef]
- Bansal, H.; Peng, C.; Bitton, Y.; Goldenberg, R.; Grover, A.; Chang, K. VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation. CoRR 2025, abs/2503.06800, [2503.06800]. [CrossRef]
- Chen, Y.; Guo, X.; Shi, Z.; Song, Z.; Zhang, J. T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation. CoRR 2025, abs/2507.18107, [2507.18107]. [CrossRef]
- Gu, J.; Liu, X.; Zeng, Y.; Nagarajan, A.; Zhu, F.; Hong, D.; Fan, Y.; Yan, Q.; Zhou, K.; Liu, M.; et al. "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models. CoRR 2025, abs/2507.13428, [2507.13428]. [CrossRef]
- Motamed, S.; Culp, L.; Swersky, K.; Jaini, P.; Geirhos, R. Do generative video models learn physical principles from watching videos? CoRR 2025, abs/2501.09038, [2501.09038]. [CrossRef]
- Han, X.; Zhu, B.; Hu, S.; Li, F.M.; Carrington, P.; Zimmermann, R.; Chen, J. OSCBench: Benchmarking Object State Change in Text-to-Video Generation. CoRR 2026, abs/2603.11698, [2603.11698]. [CrossRef]
- Choi, Y.; Park, C.; Baek, S.J. DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis. In Proceedings of the AAAI; Walsh, T.; Shah, J.; Kolter, Z., Eds. AAAI Press, 2025, pp. 2564–2572. [CrossRef]
- Kamath, A.; Chang, K.; Krishna, R.; Zettlemoyer, L.; Hu, Y.; Ghazvininejad, M. GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation. CoRR 2025, abs/2512.16853, [2512.16853]. [CrossRef]
- Nichol, A.Q.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the ICML; Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvári, C.; Niu, G.; Sabato, S., Eds. PMLR, 2022, Proceedings of Machine Learning Research, pp. 16784–16804.
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, S.K.S.; Lopes, R.G.; Ayan, B.K.; Salimans, T.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Proceedings of the NeurIPS; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
- Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; Catanzaro, B.; et al. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. CoRR 2022, abs/2211.01324, [2211.01324]. [CrossRef]
- Brack, M.; Friedrich, F.; Hintersdorf, D.; Struppek, L.; Schramowski, P.; Kersting, K. SEGA: Instructing Diffusion using Semantic Dimensions. CoRR 2023, abs/2301.12247, [2301.12247]. [CrossRef]
- Lee, J.; Lee, J.; Lee, J. CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation. CoRR 2025, abs/2508.10710, [2508.10710]. [CrossRef]
- Wang, Q.; Deng, H.; Qi, Y.; Li, D.; Song, Y. SketchKnitter: Vectorized Sketch Generation with Diffusion Models. In Proceedings of the ICLR. OpenReview.net, 2023.
- Inoue, N.; Kikuchi, K.; Simo-Serra, E.; Otani, M.; Yamaguchi, K. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. In Proceedings of the CVPR. IEEE, 2023, pp. 10167–10176. [CrossRef]
- Weng, H.; Huang, D.; Qiao, Y.; Hu, Z.; Lin, C.; Zhang, T.; Chen, C.L.P. Desigen: A Pipeline for Controllable Design Template Generation. In Proceedings of the CVPR. IEEE, 2024, pp. 12721–12732. [CrossRef]
- Zhang, J.; Guo, J.; Sun, S.; Lou, J.; Zhang, D. LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. In Proceedings of the ICCV. IEEE, 2023, pp. 7192–7202. [CrossRef]
- Hui, M.; Zhang, Z.; Zhang, X.; Xie, W.; Wang, Y.; Lu, Y. Unifying Layout Generation with a Decoupled Diffusion Model. In Proceedings of the CVPR. IEEE, 2023, pp. 1942–1951. [CrossRef]
- Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.; et al. Grounded Language-Image Pre-training. In Proceedings of the CVPR. IEEE, 2022, pp. 10955–10965. [CrossRef]
- Tumanyan, N.; Geyer, M.; Bagon, S.; Dekel, T. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the CVPR. IEEE, 2023, pp. 1921–1930. [CrossRef]
- Zhang, Z.; Han, L.; Ghosh, A.; Metaxas, D.N.; Ren, J. SINE: SINgle Image Editing with Text-to-Image Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 6027–6037. [CrossRef]
- Goel, V.; Peruzzo, E.; Jiang, Y.; Xu, D.; Sebe, N.; Darrell, T.; Wang, Z.; Shi, H. PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models. CoRR 2023, abs/2303.17546, [2303.17546]. [CrossRef]
- Huang, Y.; Xie, L.; Wang, X.; Yuan, Z.; Cun, X.; Ge, Y.; Zhou, J.; Dong, C.; Huang, R.; Zhang, R.; et al. SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models. In Proceedings of the CVPR. IEEE, 2024, pp. 8362–8371. [CrossRef]
- Deutch, G.; Gal, R.; Garibi, D.; Patashnik, O.; Cohen-Or, D. TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models. In Proceedings of the SIGGRAPH Asia; Igarashi, T.; Shamir, A.; Zhang, H.R., Eds. ACM, 2024, pp. 41:1–41:12. [CrossRef]
- Wei, T.; Zhou, Y.; Chen, D.; Pan, X. FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing. CoRR 2025, abs/2503.16153, [2503.16153]. [CrossRef]
- Shi, Y.; Xue, C.; Liew, J.H.; Pan, J.; Yan, H.; Zhang, W.; Tan, V.Y.F.; Bai, S. DragDiffusion: Harnessing Diffusion Models for Interactive Point-Based Image Editing. In Proceedings of the CVPR. IEEE, 2024, pp. 8839–8849. [CrossRef]
- Shi, Y.; Liew, J.H.; Yan, H.; Tan, V.Y.F.; Feng, J. InstaDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos. CoRR 2024, abs/2405.13722, [2405.13722]. [CrossRef]
- Liu, C.; Li, R.; Zhang, K.; Lan, Y.; Liu, D. StableV2V: Stablizing Shape Consistency in Video-to-Video Editing. CoRR 2024, abs/2411.11045, [2411.11045]. [CrossRef]
- Patel, M.; Gokhale, T.; Baral, C.; Yang, Y. ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models. In Proceedings of the AAAI; Wooldridge, M.J.; Dy, J.G.; Natarajan, S., Eds. AAAI Press, 2024, pp. 14554–14562. [CrossRef]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the ACL; Gurevych, I.; Miyao, Y., Eds. Association for Computational Linguistics, 2018, pp. 2556–2565. [CrossRef]
- Chen, S.; Lai, J.; Gao, J.; Ye, T.; Chen, H.; Shi, H.; Shao, S.; Lin, Y.; Fei, S.; Xing, Z.; et al. PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework. CoRR 2025, abs/2506.10741, [2506.10741]. [CrossRef]
- Zhang, Z.; Cheng, Y.; Hong, D.; Yang, M.; Shi, G.; Ma, L.; Zhang, H.; Shao, J.; Wu, X. CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation. CoRR 2025, abs/2506.10890, [2506.10890]. [CrossRef]
- Gao, Y.; Lin, Z.; Liu, C.; Zhou, M.; Ge, T.; Zheng, B.; Xie, H. PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 8083–8093. [CrossRef]
- Zhu, L.; Yang, D.; Zhu, T.; Reda, F.; Chan, W.; Saharia, C.; Norouzi, M.; Kemelmacher-Shlizerman, I. TryOnDiffusion: A Tale of Two UNets. In Proceedings of the CVPR. IEEE, 2023, pp. 4606–4615. [CrossRef]
- Kim, J.; Gu, G.; Park, M.; Park, S.; Choo, J. StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On. CoRR 2023, abs/2312.01725, [2312.01725]. [CrossRef]
- Li, X.; Sun, Q.; Zhang, P.; Ye, F.; Liao, Z.; Feng, W.; Zhao, S.; He, Q. AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 23723–23733. [CrossRef]
- Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In Proceedings of the ICLR. OpenReview.net, 2023.
- Tewel, Y.; Gal, R.; Chechik, G.; Atzmon, Y. Key-Locked Rank One Editing for Text-to-Image Personalization. In Proceedings of the SIGGRAPH; Brunvand, E.; Sheffer, A.; Wimmer, M., Eds. ACM, 2023, pp. 12:1–12:11. [CrossRef]
- Huang, Z.; Wu, T.; Jiang, Y.; Chan, K.C.K.; Liu, Z. ReVersion: Diffusion-Based Relation Inversion from Images. In Proceedings of the SIGGRAPH Asia; Igarashi, T.; Shamir, A.; Zhang, H.R., Eds. ACM, 2024, pp. 4:1–4:11. [CrossRef]
- Gu, J.; Wang, Y.; Zhao, N.; Fu, T.; Xiong, W.; Liu, Q.; Zhang, Z.; Zhang, H.; Zhang, J.; Jung, H.; et al. PHOTOSWAP: Personalized Subject Swapping in Images. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Cai, S.; Chan, E.R.; Zhang, Y.; Guibas, L.J.; Wu, J.; Wetzstein, G. Diffusion Self-Distillation for Zero-Shot Customized Image Generation. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 18434–18443. [CrossRef]
- Pan, X.; Dong, L.; Huang, S.; Peng, Z.; Chen, W.; Wei, F. Kosmos-G: Generating Images in Context with Multimodal Large Language Models. In Proceedings of the ICLR. OpenReview.net, 2024.
- Mou, C.; Wu, Y.; Wu, W.; Guo, Z.; Zhang, P.; Cheng, Y.; Luo, Y.; Ding, F.; Zhang, S.; Li, X.; et al. DreamO: A Unified Framework for Image Customization. In Proceedings of the SIGGRAPH Asia; Komura, T.; Wimmer, M.; Fu, H., Eds. ACM, 2025, pp. 194:1–194:12. [CrossRef]
- Zong, Z.; Jiang, D.; Ma, B.; Song, G.; Shao, H.; Shen, D.; Liu, Y.; Li, H. EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM. In Proceedings of the ICML; Singh, A.; Fazel, M.; Hsu, D.; Lacoste-Julien, S.; Berkenkamp, F.; Maharaj, T.; Wagstaff, K.; Zhu, J., Eds. PMLR / OpenReview.net, 2025, Proceedings of Machine Learning Research.
- He, Q.; Yao, A. Conceptrol: Concept Control of Zero-shot Personalized Image Generation. CoRR 2025, abs/2503.06568, [2503.06568]. [CrossRef]
- Shi, R.; Chen, H.; Zhang, Z.; Liu, M.; Xu, C.; Wei, X.; Chen, L.; Zeng, C.; Su, H. Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model. CoRR 2023, abs/2310.15110, [2310.15110]. [CrossRef]
- Cheng, T.Y.; Gadelha, M.; Groueix, T.; Fisher, M.; Mech, R.; Markham, A.; Trigoni, N. Learning Continuous 3D Words for Text-to-Image Generation. In Proceedings of the CVPR. IEEE, 2024, pp. 6753–6762. [CrossRef]
- Burgess, J.; Wang, K.; Yeung, S. Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models. CoRR 2023, abs/2309.07986, [2309.07986]. [CrossRef]
- Kumari, N.; Su, G.; Zhang, R.; Park, T.; Shechtman, E.; Zhu, J. Customizing Text-to-Image Diffusion with Camera Viewpoint Control. CoRR 2024, abs/2404.12333, [2404.12333]. [CrossRef]
- Deitke, M.; Schwenk, D.; Salvador, J.; Weihs, L.; Michel, O.; VanderBilt, E.; Schmidt, L.; Ehsani, K.; Kembhavi, A.; Farhadi, A. Objaverse: A Universe of Annotated 3D Objects. In Proceedings of the CVPR. IEEE, 2023, pp. 13142–13153. [CrossRef]
- Shen, X.; Elhoseiny, M. StoryGPT-V: Large Language Models as Consistent Story Visualizers. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 13273–13283. [CrossRef]
- Wu, J.Z.; Ge, Y.; Wang, X.; Lei, S.W.; Gu, Y.; Shi, Y.; Hsu, W.; Shan, Y.; Qie, X.; Shou, M.Z. Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. In Proceedings of the ICCV. IEEE, 2023, pp. 7589–7599. [CrossRef]
- Khachatryan, L.; Movsisyan, A.; Tadevosyan, V.; Henschel, R.; Wang, Z.; Navasardyan, S.; Shi, H. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators. In Proceedings of the ICCV. IEEE, 2023, pp. 15908–15918. [CrossRef]
- Wang, J.; Yuan, H.; Chen, D.; Zhang, Y.; Wang, X.; Zhang, S. ModelScope Text-to-Video Technical Report. CoRR 2023, abs/2308.06571, [2308.06571]. [CrossRef]
- Singer, U.; Polyak, A.; Hayes, T.; Yin, X.; An, J.; Zhang, S.; Hu, Q.; Yang, H.; Ashual, O.; Gafni, O.; et al. Make-A-Video: Text-to-Video Generation without Text-Video Data. In Proceedings of the ICLR. OpenReview.net, 2023.
- Qi, T.; Yuan, J.; Feng, W.; Fang, S.; Liu, J.; Zhou, S.; He, Q.; Xie, H.; Zhang, Y. Mask2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation. CoRR 2025, abs/2503.19881, [2503.19881]. [CrossRef]
- Cai, M.; Cun, X.; Li, X.; Liu, W.; Zhang, Z.; Zhang, Y.; Shan, Y.; Yue, X. DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 7763–7772. [CrossRef]
- Wu, J.; Li, X.; Zeng, Y.; Zhang, J.; Zhou, Q.; Li, Y.; Tong, Y.; Chen, K. MotionBooth: Motion-Aware Customized Text-to-Video Generation. In Proceedings of the NeurIPS; Globersons, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.M.; Zhang, C., Eds., 2024.
- He, Y.; Xia, M.; Chen, H.; Cun, X.; Gong, Y.; Xing, J.; Zhang, Y.; Wang, X.; Weng, C.; Shan, Y.; et al. Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation. CoRR 2023, abs/2307.06940, [2307.06940]. [CrossRef]
- Ding, H.; Liu, C.; He, S.; Jiang, X.; Loy, C.C. MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions. In Proceedings of the ICCV. IEEE, 2023, pp. 2694–2703. [CrossRef]
- Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking Objects as Points. In Proceedings of the ECCV; Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J., Eds. Springer, 2020, Lecture Notes in Computer Science, pp. 474–490. [CrossRef]
- Li, X.; Yuan, H.; Zhang, W.; Cheng, G.; Pang, J.; Loy, C.C. Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation. CoRR 2023, abs/2303.12782, [2303.12782]. [CrossRef]
- Ding, H.; Liu, C.; He, S.; Jiang, X.; Torr, P.H.S.; Bai, S. MOSE: A New Dataset for Video Object Segmentation in Complex Scenes. In Proceedings of the ICCV. IEEE, 2023, pp. 20167–20177. [CrossRef]
- Wu, X.; Hao, Y.; Sun, K.; Chen, Y.; Zhu, F.; Zhao, R.; Li, H. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. CoRR 2023, abs/2306.09341, [2306.09341]. [CrossRef]
- Ma, Y.; Shui, Y.; Wu, X.; Sun, K.; Li, H. HPSv3: Towards Wide-Spectrum Human Preference Score. CoRR 2025, abs/2508.03789, [2508.03789]. [CrossRef]
- Li, J.; Feng, W.; Chen, W.; Wang, W.Y. Reward Guided Latent Consistency Distillation. Trans. Mach. Learn. Res. 2024, 2024.
- Luo, Y.; Hu, T.; Luo, W.; Tang, J. TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward. CoRR 2026, abs/2603.07700, [2603.07700]. [CrossRef]
- Sabour, A.; Fidler, S.; Kreis, K. Align Your Flow: Scaling Continuous-Time Flow Map Distillation. CoRR 2025, abs/2506.14603, [2506.14603]. [CrossRef]
- Guo, X.; Huo, J.; Shi, Z.; Song, Z.; Zhang, J.; Zhao, J. T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation. CoRR 2025, abs/2505.00337, [2505.00337]. [CrossRef]
- Wang, Z.; Wei, X.; Li, B.; Guo, Z.; Zhang, J.; Wei, H.; Wang, K.; Zhang, L. VideoVerse: How Far is Your T2V Generator from a World Model? CoRR 2025, abs/2510.08398, [2510.08398]. [CrossRef]
- Srivatsan, K.; Shamshad, F.; Naseer, M.; Nandakumar, K. STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models. CoRR 2024, abs/2408.16807, [2408.16807]. [CrossRef]
- Lu, K.; Kriplani, N.; Gandikota, R.; Pham, M.; Bau, D.; Hegde, C.; Cohen, N. When Are Concepts Erased From Diffusion Models? CoRR 2025, abs/2505.17013, [2505.17013]. [CrossRef]
- Rusanovsky, M.; Malnick, S.; Jevnisek, A.; Fried, O.; Avidan, S. Memories of Forgotten Concepts. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 2966–2975. [CrossRef]
- Lee, U.; Kim, J.; Hwang, S. Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection. CoRR 2026, abs/2602.19631, [2602.19631]. [CrossRef]
- Lee, B.H.; Lim, S.; Lee, S.; Kang, D.U.; Chun, S.Y. Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate. In Proceedings of the ICLR. OpenReview.net, 2025.
- Kim, S.; Jung, S.; Kim, B.; Choi, M.; Shin, J.; Lee, J. Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion. In Proceedings of the ECCV; Leonardis, A.; Ricci, E.; Roth, S.; Russakovsky, O.; Sattler, T.; Varol, G., Eds. Springer, 2024, Lecture Notes in Computer Science, pp. 128–145. [CrossRef]
- Bakr, E.M.; Sun, P.; Shen, X.; Khan, F.F.; Li, L.E.; Elhoseiny, M. HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models. In Proceedings of the ICCV. IEEE, 2023, pp. 19984–19996. [CrossRef]
- Hu, X.; Wang, R.; Fang, Y.; Fu, B.; Cheng, P.; Yu, G. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment. CoRR 2024, abs/2403.05135, [2403.05135]. [CrossRef]
- Lin, Z.; Pathak, D.; Li, B.; Li, J.; Xia, X.; Neubig, G.; Zhang, P.; Ramanan, D. Evaluating Text-to-Visual Generation with Image-to-Text Generation. In Proceedings of the ECCV; Leonardis, A.; Ricci, E.; Roth, S.; Russakovsky, O.; Sattler, T.; Varol, G., Eds. Springer, 2024, Lecture Notes in Computer Science, pp. 366–384. [CrossRef]
- Wang, S.; Saharia, C.; Montgomery, C.; Pont-Tuset, J.; Noy, S.; Pellegrini, S.; Onoe, Y.; Laszlo, S.; Fleet, D.J.; Soricut, R.; et al. Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting. In Proceedings of the CVPR. IEEE, 2023, pp. 18359–18369. [CrossRef]
- Zhang, K.; Mo, L.; Chen, W.; Sun, H.; Su, Y. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Huang, Z.; He, Y.; Yu, J.; Zhang, F.; Si, C.; Jiang, Y.; Zhang, Y.; Wu, T.; Jin, Q.; Chanpaisit, N.; et al. VBench: Comprehensive Benchmark Suite for Video Generative Models. In Proceedings of the CVPR. IEEE, 2024, pp. 21807–21818. [CrossRef]
- Han, H.; Li, S.; Chen, J.; Yuan, Y.; Wu, Y.; Deng, Y.; Leong, C.T.; Du, H.; Fu, J.; Li, Y.; et al. Video-Bench: Human-Aligned Video Generation Benchmark. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 18858–18868. [CrossRef]
- Liu, Y.; Cun, X.; Liu, X.; Wang, X.; Zhang, Y.; Chen, H.; Liu, Y.; Zeng, T.; Chan, R.; Shan, Y. EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. In Proceedings of the CVPR. IEEE, 2024, pp. 22139–22149. [CrossRef]
- Liu, Y.; Li, L.; Ren, S.; Gao, R.; Li, S.; Chen, S.; Sun, X.; Hou, L. FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Zhuang, C.; Huang, A.; Cheng, W.; Wu, J.; Hu, Y.; Liao, J.; Huang, Z.; Wang, H.; Liao, X.; Cai, W.; et al. ViStoryBench: Comprehensive Benchmark Suite for Story Visualization. CoRR 2025, abs/2505.24862, [2505.24862]. [CrossRef]
- Dave, A.; Khurana, T.; Tokmakov, P.; Schmid, C.; Ramanan, D. TAO: A Large-Scale Benchmark for Tracking Any Object. In Proceedings of the ECCV; Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J., Eds. Springer, 2020, Lecture Notes in Computer Science, pp. 436–454. [CrossRef]
- Miao, J.; Wei, Y.; Wu, Y.; Liang, C.; Li, G.; Yang, Y. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2021, pp. 4133–4143. [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2020, pp. 11618–11628. [CrossRef]
- Chen, Y.; Zhu, X.; Li, T.; Chen, H.; Shen, C. A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction. CoRR 2025, abs/2502.05503, [2502.05503]. [CrossRef]
- Mariam, K.M.M.; Arun, A.; Laskar, Z.; Jawahar, C.V. PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education. CoRR 2026, abs/2601.00943, [2601.00943]. [CrossRef]
- Montanaro, A.; Aira, L.S.; Aiello, E.; Valsesia, D.; Magli, E. MotionCraft: Physics-Based Zero-Shot Video Generation. In Proceedings of the NeurIPS; Globersons, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.M.; Zhang, C., Eds., 2024.
- Samuel, D.; Tzachor, I.; Levy, M.; Green, M.; Chechik, G.; Ben-Ari, R. Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention. CoRR 2026, abs/2602.01801, [2602.01801]. [CrossRef]
- Gandikota, R.; Materzynska, J.; Fiotto-Kaufman, J.; Bau, D. Erasing Concepts from Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 2426–2436. [CrossRef]
- Kumari, N.; Zhang, B.; Wang, S.; Shechtman, E.; Zhang, R.; Zhu, J. Ablating Concepts in Text-to-Image Diffusion Models. In Proceedings of the ICCV. IEEE, 2023, pp. 22634–22645. [CrossRef]
- Geyer, M.; Bar-Tal, O.; Bagon, S.; Dekel, T. TokenFlow: Consistent Diffusion Features for Consistent Video Editing. In Proceedings of the ICLR. OpenReview.net, 2024.
- Avrahami, O.; Hertz, A.; Vinker, Y.; Arar, M.; Fruchter, S.; Fried, O.; Cohen-Or, D.; Lischinski, D. The Chosen One: Consistent Characters in Text-to-Image Diffusion Models. In Proceedings of the SIGGRAPH; Burbano, A.; Zorin, D.; Jarosz, W., Eds. ACM, 2024, p. 26. [CrossRef]
- Chen, H.; Zhang, Y.; Cun, X.; Xia, M.; Wang, X.; Weng, C.; Shan, Y. VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. In Proceedings of the CVPR. IEEE, 2024, pp. 7310–7320. [CrossRef]
- Yang, S.; Du, Y.; Ghasemipour, S.K.S.; Tompson, J.; Kaelbling, L.P.; Schuurmans, D.; Abbeel, P. Learning Interactive Real-World Simulators. In Proceedings of the ICLR. OpenReview.net, 2024.










| Survey scope | Refs. | Focus | Ext. | Int. | Norm. | Cross-rel. |
|---|---|---|---|---|---|---|
| Text-to-image / Controllable generation | [16,17,24] | Prompt following, conditioning, and preference-related evaluation | ✓ | – | Partial | Limited |
| Editing / Personalization | [8,9,25,26] | Edit preservation, subject binding, and user-adaptive generation | ✓ | ✓ | Partial | Limited |
| Video / Long-form generation | [18,19,27] | Temporal coherence, video synthesis, and narrative continuity | Partial | ✓ | Partial | Partial |
| Alignment / Safety | [20,28,29] | Preference, safety, robustness, and trustworthy generation | Partial | – | ✓ | Partial |
| 3D / 4D / Physical generation | [21,22,23,30] | Geometry, dynamics, physical plausibility, and embodied world modeling | Partial | ✓ | ✓ | Partial |
| This survey | – | Agreement relations, enforcement loci, evaluation protocols, and trade-offs | ✓ | ✓ | ✓ | ✓ |
| Family | Target | Locus | Failure tag | Anchor methods |
|---|---|---|---|---|
| Attention repair | Prompt / comp. | Attention | Omission / grounding | Attend-and-Excite [31], SEGA [113] |
| Spatial guidance | Layout / count | Sampling | Relation / count | BoxDiff [58] |
| Control adapters | Structure | Condition path | Condition mismatch | ControlNet [32], T2I-Adapter [63] |
| Grounding modules | Region / reference | Grounding | Anchoring / ref. drift | GLIGEN [48], IP-Adapter [64] |
| Editing | Instruction / mask | Edit path | Over-editing / drift | DiffEdit [69], InstructPix2Pix [33], InstructDiffusion [70] |
| Subproblem | State | Coupling route | Drift tag | Anchor methods |
|---|---|---|---|---|
| Subject identity | Subject ID | Adaptation / ID feature | Face / clothing | DreamBooth [34], PhotoMaker [52], InstantID [76] |
| Story identity | Character role | Cross-image propagation | Character drift | StoryDiffusion [47], ConsiStory [77] |
| Multi-view / 3D | Geometry | View coupling | View / geometry | Zero-1-to-3 [78], SyncDreamer [80], MVDream [35] |
| Text-to-video | Motion / appearance | Temporal module | Flicker / motion | AnimateDiff [36] |
| Video editing | Edited structure | Frame coupling | Shape wobble | StableV2V [129] |
| Narrative | Entity / event state | Planning / memory | Forgetting / contradiction | TaleCrafter [87], MovieDreamer [89] |
| Type | Target | Evidence | Risk / failure | Anchor methods / resources |
|---|---|---|---|---|
| Human-centered | Preference / aesthetics | Human reward | Reward / diversity | ImageReward [53], HPSv3 [166], Diffusion-DPO [37] |
| Safety / values | Unsafe set + retention | Over-refusal / amnesia | SLD [38], UCE [97], Six-CD [101] | |
| World-centered | Physical plausibility | Physics test | Physical violation | PhyBench [102], VideoPhy-2 [103], T2VPhysBench [170] |
| Commonsense / causality | World-state test | Scene / causal break | T2VWorldBench [104], PhyWorldBench [105], VideoVerse [171] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
