Consistency in Diffusion-Based Visual Generation: A Survey

Song Yan; Wei Zhai; Chenfeng Wang; Ruixuan Li; Zhangping Yang; Yancheng Cai; Tao Zhang; Ling Wang; Yunwei Lan; Yujie He; Yang Cao; Min Li; Zheng-Jun Zha

doi:10.20944/preprints202606.0870.v1

Submitted:

10 June 2026

Posted:

11 June 2026

You are already at the latest version

Abstract

Diffusion models are now a standard foundation for visual generation, including image synthesis, editing, video generation, and 3D-aware content creation. As these models are used for more structured creation, visual quality alone is often insufficient. Generated content can omit prompt details, break compositional relations, lose subject identity across instances, drift over time, disagree across viewpoints, or violate preference, safety, and physical-plausibility requirements. These problems are studied in many task-specific literatures, but they are seldom organized by the agreement relation that a method is designed to enforce. We treat consistency as a requirement on generated content and formulate it as a family of agreement relations. We distinguish three primary relations: external consistency, which concerns agreement with user-specified or task-provided conditions; internal consistency, which concerns coherence among generated parts, views, instances, or temporal states; and normative consistency, which concerns agreement with evaluative criteria such as human preference, safety, and world-level plausibility. We review representative methods under this relation-based view, comparing their scope, preserved properties, and enforcement mechanisms. We also summarize evaluation protocols and benchmarks, examine cross-category trade-offs, and discuss open problems in conflict-aware optimization, interpretable metrics, persistent memory, and long-horizon world-consistent generation. The survey links failure modes to mechanism families, evaluation choices, and unresolved trade-offs.

Keywords:

diffusion models

;

consistency

;

controllable generation

;

image and video generation

;

generative evaluation

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Diffusion models are now widely used for visual content generation, from text-to-image synthesis and image editing to personalization, video generation, and 3D-aware content creation. Their appeal comes from strong generative priors, scalable training, and flexible conditioning mechanisms, which together support high visual fidelity, diversity, and control. Representative developments include denoising diffusion models [1,2], score-based generative models [3], latent diffusion [4], and large-scale text-to-image systems [5,6,7]. The same modeling paradigm has been extended to editing [8], personalization [9], video generation [10,11], 3D-aware synthesis [12,13], and world-modeling-oriented generation [14,15]. The extensions improve fidelity and controllability, while making a familiar limitation more visible: a visually plausible output need not be a consistent one.

We focus on that limitation: consistency. In structured generation, the user often expects more than a visually plausible sample. The output may need to follow a detailed prompt, preserve a subject identity, respect a layout or reference image, remain stable across video frames, agree across viewpoints, and avoid unsafe or physically implausible behavior in practice. These requirements are straightforward to state but difficult to satisfy jointly and reliably.

Most existing surveys are organized by task or modality, including text-to-image generation [16], controllable generation [17], image editing [8], personalization [9], video diffusion [18], long-form storytelling [19], alignment [20], 3D generation [21], 4D generation [22], and physical generative modeling [23]. That organization is useful, but it can hide problems with the same underlying structure. Prompt faithfulness, identity preservation, temporal coherence, multi-view agreement, and safety alignment use different terminology and are often evaluated with incompatible protocols. We ask a more basic question: what is the target of agreement in diffusion-based generation, and how should methods, benchmarks, and trade-offs be organized once that target is made explicit?

Table 1 positions this survey relative to closely related work. Prior surveys cover particular tasks, modalities, or alignment problems in depth. Here, we organize the discussion around the agreement relation being enforced, so that method families, evaluation protocols, failure modes, and trade-offs can be compared across task boundaries. The supplementary material details the literature-collection procedure and inclusion/exclusion criteria.

We use consistency to mean a family of agreement requirements, not a single score. A generated output may need to match external conditions, remain coherent across parts, views, or temporal states, and satisfy evaluative criteria such as preference, safety, or physical plausibility. We organize the literature around three relations: external consistency, internal consistency, and normative consistency, comparing methods by the agreement target they preserve.

Organizing by agreement relation separates task labels from what is being optimized. Attention repair, structural control, and editing bind outside conditions to generation [31,32,33]; personalization, multi-view generation, and temporal modules stabilize shared state [34,35,36]; and preference optimization, safe generation, and physical evaluation judge outputs against broader standards [37,38,39]. The contribution is therefore not another task catalogue, but a framework that connects mechanisms, metrics, benchmarks, failure modes, and trade-offs through the agreement target each method preserves.

The taxonomy has one primary relation axis, four auxiliary axes, and secondary-relation annotations. The auxiliary axes specify observation unit, agreement target, optimization locus, and evidence source. Secondary relations are recorded only when overlap changes mechanism design, evaluation, or trade-off analysis. These elements distinguish prompt faithfulness from compositionality, identity persistence from multi-view coherence, and human-centered preference consistency from world-centered plausibility, while keeping cross-relation conflicts visible.

Evaluation recurs throughout the survey. We map consistency targets to protocols and benchmarks, discuss metric blind spots, and examine tensions such as faithfulness versus aesthetics, identity consistency versus editability, and temporal stability versus motion richness.

The main contributions are as follows:

We define consistency in diffusion-based generation as an overlap-aware family of agreement requirements, rather than a task-specific label, visual-quality proxy, or single metric.
We organize the literature into external, internal, and normative consistency, compare methods through four auxiliary axes, and use secondary-relation annotations to mark overlap.
We connect major consistency targets with mechanisms, evaluation protocols, benchmarks, failure modes, and known metric blind spots.
We identify trade-offs when multiple consistency requirements must be satisfied jointly, including conflicts among faithfulness, identity preservation, safety, preference, temporal stability, and physical plausibility.

The rest of the survey is organized as follows. Section 2 formulates consistency as a generative requirement and introduces the three primary relations used throughout the paper. It also defines two cross-cutting views, evaluation protocol and optimization locus, that recur in later sections. Section 3, Section 4 and Section 5 review external, internal, and normative consistency, respectively, covering representative methods, evaluation protocols, and application settings. Section 6 discusses cross-relation trade-offs and open problems. Section 7 concludes the survey.

2. Background and Problem Formulation

2.1. Consistency as a Generative Requirement

Here, consistency denotes a requirement that generated content remain compatible with the constraints or evaluative criteria that matter for the task. The constraints may involve semantics, structure, identity, viewpoint, time, safety, preference, or physical plausibility. The usage is broader than the specific family of Consistency Models [40] for fast or one-step sampling. Here, consistency is not a model class, sampler, or acceleration paradigm. It is a property of generated content, motivated by recurring requirements in controllable generation and alignment-oriented diffusion systems [17,20].

The definition covers several recurring requirements. In text-to-image synthesis, consistency appears as faithfulness to requested objects, attributes, counts, and relations [41,42]. In editing, it requires executing an instruction while preserving content that should not change [33,43]. In personalization, it appears as subject persistence across contexts [34,44]. In video and multi-view generation, it becomes cross-frame or cross-view coherence [35,45]. Beyond prompt matching, evaluative criteria also arise: an output may look realistic but still be unsafe, undesirable, or physically implausible [37,39]. Figure 1 contrasts these agreement targets and their typical failure modes.

2.2. Three Primary Consistency Relations

We distinguish three primary consistency relations: external consistency, internal consistency, and normative consistency. The distinction is based on the source of the agreement target rather than on task names or modalities. External consistency compares an output with user- or task-provided conditions; internal consistency compares related generated states with one another; normative consistency compares outputs with evaluative criteria such as preference, safety, or world plausibility. The separation reflects a practical fact: prompt faithfulness, structure control, identity persistence, temporal coherence, safety, preference, and physical plausibility are usually evaluated and optimized through different evidence sources and intervention loci [37,41].

The three relations are not mutually exclusive task categories. They identify the primary source of agreement, while practical generation settings often involve several relations at once. Image editing, for example, requires external consistency with an instruction and internal consistency with source content that should be preserved [33,43]; story generation requires external consistency with narrative prompts and internal consistency across characters, objects, and events [46,47]; safe personalized generation may require internal identity persistence, external prompt following, and normative safety constraints simultaneously [34,38]. We use the three relations as the main organizing axis and mark secondary relations when they affect mechanism design, evaluation, or trade-offs.

External consistency concerns agreement between generated output and externally specified conditions. These conditions may include text prompts, layouts, masks, poses, depth maps, reference images, editing instructions, or multiple simultaneous controls [32,48]. From this perspective, prompt faithfulness, compositional correctness, grounded generation, and instruction-based editing are special cases of the same requirement [33,41]. The defining property is that the agreement target lies outside the generated content, whether it comes from the user, the task definition, or an auxiliary control source. External consistency evaluates whether the output satisfies that condition.

Internal consistency concerns compatibility among different parts, views, instances, or temporal states of the generated content. It primarily includes identity preservation across images, geometric and appearance consistency across viewpoints, temporal coherence across video frames, and long-range narrative continuity [34,47]. At the boundary, it can also include intra-image self-compatibility when local structure, support, occlusion, or illumination must remain mutually plausible. The target is no longer only an external instruction; generated outputs must remain mutually compatible even when separated across space, viewpoint, or time. Recent personalization and video-generation methods also show that internal consistency can compete with editability, prompt alignment, and motion diversity [44,49].

Normative consistency concerns agreement with evaluative criteria not fully specified by the instance prompt and expected to hold by default, such as human preference, safety requirements, physical plausibility, commonsense expectations, or policy restrictions. We use “normative” operationally rather than morally: the reference is a standing evaluative standard outside both the literal prompt and the generated sample, not necessarily a human value judgment. Thus, a criterion is external when explicitly requested as part of a prompt or control input, but normative when it governs acceptable generation even without being restated. Preference alignment and safe generation are human-centered cases, while physical and causal plausibility are world-centered cases [37,39].

Beyond the primary relation axis, four auxiliary axes make the taxonomy operational: observation unit, such as a single image, edited pair, identity-conditioned set, multi-view bundle, or video/story sequence; agreement target, such as prompt semantics, layout, reference appearance, identity, geometry, temporal state, safety, preference, or physical plausibility; optimization locus, including data/objective design, condition-path design, denoising-time intervention, cross-instance coupling, or post-hoc verification; and evidence source, including VQA or MLLM checks, similarity metrics, geometric signals, temporal diagnostics, reward models, safety classifiers, human judgments, and stress tests [50,51].

The relations are conceptually distinct, but often meet in one system. A personalized editor may combine identity preservation, instruction following, and aesthetic preference; a long-form video system may combine prompt faithfulness, character persistence, and physical plausibility [39,47]. We assign each method to its dominant agreement requirement and add secondary-relation annotations only when overlap affects mechanisms, evaluation, or trade-offs. Thus, editing remains external when instruction execution is primary, but collateral identity drift is an internal secondary relation; personalization remains internal when identity state is primary, but safety filtering adds a normative secondary relation; and world-grounded video evaluation remains normative when plausibility is the target, while temporal state tracking adds an internal secondary relation.

2.3. How Consistency Can Be Evaluated

A consistency claim is meaningful only if its evaluation protocol is clear. The protocol must define what is observed, what property is tested, what evidence supports the judgment, and what evaluation output is produced. We describe this protocol through four components: observation unit, agreement target, source of evidence, and evaluation output. Figure 2 summarizes this view and shows why evaluation must match the tested consistency relation [41,50].

Observation unit. Consistency is not always evaluated on the same object. Some protocols assess a single generated image against its prompt [41,50]. Others assess a before/after editing pair, asking whether the requested change was realized while non-edited content remained stable [33,43]. Identity-preservation protocols operate on a set of identity-conditioned images [34,52]; multi-view protocols operate on a bundle of related views [35,51]; temporal and story-oriented settings operate on a sequence of frames or panels [45,47]. The distinction matters because many failures are invisible at one unit of observation and only emerge when outputs are compared across views, frames, or identity-linked samples.

Agreement target. Once the observation unit is fixed, evaluation must specify what form of agreement is being checked. For external consistency, the target is usually output-to-condition agreement: prompt faithfulness, compositional correctness, structure adherence, or edit preservation [32,41]. For internal consistency, it is cross-instance compatibility: identity persistence, multi-view compatibility, temporal continuity, or narrative state stability [34,51]. For normative consistency, it is agreement with an evaluative principle such as human preference, policy compliance, physical plausibility, or commonsense validity [39,53]. Thus, two protocols may operate on the same image set yet measure different properties because they ask different agreement questions.

Source of evidence. The same agreement target can be probed through different evidence sources. One family uses MLLM- or VQA-based verification, which is natural when the claim can be expressed as questions about objects, attributes, relations, counts, or edit outcomes [41,50]. A second family uses similarity and matching signals, including identity similarity, retrieval-style evidence, or cross-view feature agreement [51,52]. A third family uses set- or sequence-level checks, where temporal smoothness, reconstruction agreement, state continuity, or long-range coherence become relevant [45,47]. A fourth family uses learned evaluators or reward models for preference, aesthetics, or safety scoring [37,53]. Some claims still require human judgment or stress tests, especially when success is difficult to reduce to a single automated proxy [39,54].

Evaluation output. Different protocol choices produce different outputs. Some protocols yield semantic correctness signals, such as whether requested objects, attributes, relations, or counts are realized [41,42]. Others yield edit-success and preservation judgments [33,43]. Identity-focused and set-based protocols yield persistence or compatibility scores across images, views, or frames [34,51]. Preference- and safety-oriented protocols yield normative judgments, such as preference alignment, safety compliance, or world plausibility [38,53]. Consistency evaluation should not be reduced to one master metric; it consists of output classes corresponding to different agreement claims.

No single metric covers the whole space. A prompt-faithfulness protocol may reveal little about identity persistence. A metric sensitive to multi-view or temporal incompatibility may still miss unsafe content or preference misalignment. Conversely, an aesthetic reward model may say little about edit preservation or causal plausibility in generated video. Consistency evaluation is a coverage problem: each protocol observes one slice of the agreement space and should be interpreted by the slice it makes visible.

Figure 2 explains why evaluation remains fragmented. A benchmark may cover only one slice of the agreement space; single-sample realism can hide failures that appear only across views, frames, or generated sets; and even designed suites under-sample the prompts, identities, motions, and world constraints that real systems must handle [39,41,42,47,51,54]. These limits motivate the benchmark-coverage analysis developed later in the survey.

2.4. Where Consistency Can Be Optimized

After the agreement target is defined, the next question is where consistency can be enforced in a diffusion system. In practice, consistency is not optimized by a universal module. It can be shaped before sampling, at the condition interface, during denoising, across coupled outputs, or after generation. Figure 3 summarizes this process-centered view. The relation view specifies what consistency is required; the locus view specifies where the generation pipeline is modified to encourage it. We use this view later to compare methods within each relation.

Data / objective optimization. Data- and objective-level methods improve consistency before generation begins by changing the training distribution, supervision signal, or optimization objective. Examples include subject-specific adaptation, preference alignment, persistent safety editing, and structurally supervised finetuning [34,37]. The methods write consistency into model parameters, making the behavior more persistent across generations. The cost is additional training, narrower specialization, or greater risk of collateral degradation. Figure 3 denotes this locus as consistency imposed through data and objective design.

Condition-path optimization. Condition-path methods operate where external conditions enter the model. They determine how prompts, layouts, poses, depth maps, masks, and reference images are encoded, fused, routed, or injected so that they become effective denoising-time commitments [32,55]. This locus is natural for grounded and structural external consistency, where the question is how the control signal is represented and coupled to the denoiser. The locus is defined by where the intervention enters the pipeline, not by whether the module is trained: trainable side branches such as ControlNet and lightweight inference-time adapters can both belong here. Stronger condition injection improves controllability, but may increase condition conflict, reduce flexibility, or over-constrain the output.

Trajectory-time optimization. Denoising-time methods modify the reverse process through guidance, attention intervention, semantic steering, or constraint-based correction [31,43]. The intervention supports prompt repair, local semantic emphasis, layout correction, and some forms of safety control. It treats consistency as an online correction problem: when the denoising path drifts from the target agreement relation, the trajectory can be redirected before final sampling. Stronger guidance may improve the chosen form of consistency, but it can reduce realism, diversity, or sampling efficiency. Recent trajectory- and timestep-aware calibration methods further suggest that consistency can be shaped by exploiting cross-timestep structure rather than final-output supervision alone [56].

Cross-instance optimization. Cross-instance methods become necessary once generation is no longer single-sample. If the target is identity persistence, multi-view compatibility, temporal continuity, or narrative coherence, the system often has to couple related outputs rather than generate them independently. In this locus, views, frames, identity-linked images, or story panels share latent state, features, memory, or synchronized denoising pathways [35,47]. Consistency then becomes a property of the joint generation process. Tighter coupling can improve internal compatibility, but it usually increases memory cost, state-management complexity, or constraints on local edits.

Post-hoc optimization. Post-hoc methods leave the generator unchanged and let evaluators, rerankers, safety filters, or corrective modules decide which outputs to retain, reject, revise, or surface to users [39,53]. The locus is common when the target principle is difficult to encode into the denoiser, as in preference alignment, safety screening, or world-grounded plausibility checking. Post-hoc optimization is modular and straightforward to integrate, but it often treats symptoms rather than causes: it can improve output quality as perceived by users without necessarily changing the generator’s tendency to produce inconsistent outputs.

The same consistency relation can be enforced at several pipeline points. External consistency may use condition-path design, denoising-time guidance, or post-hoc verification [32,41]. Internal consistency often relies on parameter adaptation, cross-instance coupling, or memory-based sequence modeling [34,47]. Normative consistency may use preference finetuning, safety steering, concept editing, or evaluator-based filtering [37,38]. The choices create trade-offs in persistence, flexibility, cost, and controllability.

Across the relation-specific sections, we compare methods along the four auxiliary axes above: observation unit, agreement target, optimization locus, and evidence source. We discuss the trade-off profile, which captures possible sacrifices, such as diversity, editability, efficiency, benign capability, or prompt faithfulness [44,57]. The method tables use compact matrix views that emphasize the most relevant subset of these axes, while benchmark evidence and diagnostic coverage are summarized separately in Table 5. Figure 4 summarizes this taxonomy and links the three consistency relations to the subsections that follow.

3. External Consistency

External consistency evaluates whether generated outputs satisfy conditions specified outside the sample itself, including text prompts, layouts, sketches, poses, masks, reference images, and editing instructions. Unlike perceptual realism, it evaluates whether the requested content and controls are realized in the output. We discuss three forms of output-to-condition agreement: prompt and compositional faithfulness, grounded or structural control, and instruction-based editing. The settings enforce different kinds of conditions, but the underlying requirement is the same: the generated output should reflect the user- or task-provided constraint without unnecessary drift.

3.1. Prompt and Compositional Consistency

Prompt and compositional consistency concerns whether a generated image realizes the objects, attributes, counts, and relations specified by the prompt. Coarse prompt alignment is now common in strong text-to-image backbones, but fine-grained faithfulness remains brittle. A generated image may match the general topic while omitting a required object, binding an attribute to the wrong subject, confusing a spatial relation, or failing on multi-object counting. The problem is not generic image–text similarity. It is whether the prompt is grounded into the correct visual entities and relations.

The failures are better treated as assignment errors than as generic text–image mismatch. Common cases include subject omission, attribute misbinding, relation confusion, and counting failure. Benchmarks such as TIFA [50], GenEval [41], T2I-CompBench [42], and GenEval2 [109] make these errors visible by evaluating objects, attributes, relations, and counts separately instead of collapsing them into a single holistic similarity score. Representative methods intervene at different points in the generation pipeline. Stronger backbones, such as GLIDE [110], Imagen [111], eDiff-I [112], and SDXL [5], improve language grounding and visual coverage, but they do not specifically target prompt-faithfulness failures. More targeted methods modify the sampling process. Attend-and-Excite [31] boosts neglected subject tokens through cross-attention feedback, while SEGA [113] steers the denoising trajectory along semantic directions. Spatial guidance methods such as BoxDiff [58] translate abstract prompt requirements into localized constraints during sampling.

Prompt and compositional consistency should not be treated as a monolithic capability. Diffusion systems can perform well on open-ended prompts yet fail on binding-heavy, count-sensitive, or relation-sensitive prompts. Methods such as Attend-and-Excite [31], BoxDiff [58], and GLIGEN [48] follow a common pattern: they augment pretrained backbones with explicit repair, grounding, or verification mechanisms so that structured failures become easier to diagnose and correct. As shown in Figure 5, Attend-and-Excite [31] treats prompt faithfulness as a sampling-time repair problem. It identifies neglected subject tokens through cross-attention maps, converts them into a lightweight coverage objective, and updates the latent state so that missing concepts are more likely to appear. It illustrates a broader class of methods that correct semantic under-realization during denoising rather than relying only on backbone scaling.

We group prompt faithfulness and compositional faithfulness together because both test whether symbolic prompt constraints are grounded into the correct visual entities and relations. Compositional prompts make this requirement harder by adding binding, counting, and relation constraints, but they do not change the underlying consistency relation. Count-sensitive prompting also illustrates this point: methods such as Make It Count [60], CountCluster [114], YOLO-Count [62], and CountDiffusion [61] make numerosity and instance coverage explicit during generation or guidance.

Prompt-consistency methods differ in whether they repair weak grounding during sampling, introduce explicit spatial or counting constraints, or rely on stronger pretrained language–image representations. Attention-repair methods are lightweight and training-free, but they mainly address missing or weakly activated concepts and may not resolve relation-level conflicts. Spatial and counting guidance methods expose object placement and numerosity more directly, but they often require additional detectors, boxes, or constraint formulations. Backbone scaling improves broad language grounding, yet it does not guarantee systematic compositional correctness. The differences suggest that prompt consistency is limited less by image realism than by the absence of explicit binding mechanisms between linguistic units and visual evidence.

3.2. Grounded and Structural Consistency

Grounded and structural consistency extends prompt faithfulness to control signals beyond text, including layouts, poses, depth maps, masks, reference images, and combinations of these conditions. The requirement is that text, layout, reference, and auxiliary controls be satisfied jointly rather than in isolation. The setting has three related requirements. Grounded consistency asks whether requested content appears in the right place. Structural consistency asks whether geometry, pose, edge, or layout constraints are treated as commitments rather than soft stylistic hints. Multi-condition consistency asks whether heterogeneous controls can coexist without incompatible guidance during denoising, which is especially relevant when text is combined with depth, masks, references, or spatial priors.

We group these methods by where the conditions enter the model. One route introduces side pathways or lightweight control modules. ControlNet [32] attaches trainable control branches to a largely frozen denoiser, enabling edge, pose, depth, and related structural conditioning without rewriting the full generator. T2I-Adapter [63] follows a lightweight adapter design, while GLIGEN [48] supports grounded generation through grounding tokens and box-aware interaction. A second route emphasizes reference or appearance control. IP-Adapter [64] and related pipelines convert image features into reusable conditioning streams, extending external consistency from text-layout agreement to text-reference agreement. A third route studies condition composition, asking how multiple controls should be weighted, fused, or prioritized when they conflict.

As shown in Figure 6, ControlNet [32] illustrates condition-path optimization: a frozen diffusion backbone is augmented with a trainable control branch that injects structural signals during denoising. The difficulty often lies less in the number of control modalities than in conflicts among them. A pose prior may disagree with a textual action; a reference image may preserve identity while pulling style in an unwanted direction; a layout may be correct but visually over-constraining. Stronger control is not always better, because enforcing one condition can weaken another or reduce diversity.

Mechanism-level differences are better understood by asking where condition arbitration occurs. Side-pathway methods bind structural signals to dedicated interfaces [32,63], whereas reference-feature and energy-guided methods leave more arbitration to denoising-time fusion or guidance [64,66,68]. The former usually improves traceability and persistence; the latter is more flexible but can make conflicts harder to diagnose.

Recent work expands the design space along three families: stronger structural control over sketches, poses, layouts, or energy guidance [66,67,68,115]; layout-centric generation for design structure [116,117,118,119]; and multi-condition or reference grounding [59,65,120]. These families address different failures: weak structural adherence, poor design-level arrangement, and conflicts among heterogeneous conditions.

The main trade-off is control strength versus generative flexibility. Control branches and adapters provide reusable structural interfaces, but depend on condition quality and compatibility. Grounding-token, box-aware, and reference-adapter methods improve anchoring or appearance agreement, yet may leave fine-grained interactions under-constrained or transfer unintended identity, style, or background details. Progress therefore requires not only stronger conditioning, but explicit mechanisms for detecting, prioritizing, and resolving conflicts among text, layout, pose, depth, mask, and reference image.

3.3. Edit Consistency

Edit consistency is an external-consistency problem with an additional preservation constraint. The model must execute an instruction while keeping unchanged regions stable. Editing is thus a two-sided requirement: the requested change should be localized, and content outside the edit target should remain invariant. It is also a boundary case, because preservation within one edited sample can extend to identity or temporal consistency when edits are applied across multiple outputs or frames.

A compact formulation separates three components: the edit target, the preservation set, and the boundary between them. The edit target specifies what should change, such as an attribute, style, background element, or action. The preservation set specifies what should remain stable, such as identity, geometry, composition, illumination, or layout. The boundary determines whether the edit remains local or leaks into unrelated regions. Many editing failures are boundary failures: the requested change is made, but adjacent or visually coupled content drifts as collateral damage.

Methods differ mainly in how they localize and preserve. DiffEdit [69] contrasts source and target prompts to infer an edit mask and perform region-aware latent editing; as shown in Figure 7, this exposes the core editing logic compactly: prompt differences define an edit mask, and mask-guided diffusion applies the change while preserving surrounding content. InstructPix2Pix [33] and InstructDiffusion [70] foreground natural-language editing instructions, turning instruction following into a direct editing interface. Prompt-to-Prompt [43] controls cross-attention maps to preserve the original semantic layout while replacing edited content. Video-editing systems such as Video-P2P [86] and FateZero [85] extend the same preservation problem to temporally coupled outputs.

Edit consistency cannot be judged by edit success alone. A system that executes the requested change but rewrites geometry, identity, or unrelated objects is inconsistent; one that preserves nearly everything but under-executes the instruction is also unsatisfactory. The dual requirement places editing at the intersection of instruction following, structure preservation, and localized control. We classify it as external consistency when the dominant requirement is to realize an externally supplied instruction within a single edited sample. When preservation must hold across multiple outputs, views, or frames, the dominant difficulty shifts toward internal consistency. Table 2 summarizes representative external-consistency method families.

Editing methods differ in how they access and preserve source content. Inversion- or exemplar-based methods, including Null-Text Inversion [71], Plug-and-Play Diffusion Features [121], Paint-by-Example [73], Imagic [72], and SINE [122], use latent reconstruction or exemplar guidance to retain source content. Instruction-aware and feature-centric editors, including PAIR-Diffusion [123], SmartEdit [124], TurboEdit [125], and FreeFlux [126], translate language or feature controls into localized edits. Dragging and video-editing methods, such as DragDiffusion [127], InstaDrag [128], Video-P2P [86], and StableV2V [129], extend preservation to point trajectories or coupled frames.

External-consistency methods differ in training cost, control interface, and deployment setting, but many strengthen the path from explicit condition to denoising. Control-side methods inject structure or reference features into the backbone [32,48,63,64]; attention-side methods steer which tokens or regions dominate generation [31,43]; and sampling-time methods modify the trajectory itself [58,113]. We use condition traceability to describe whether the final sample makes clear how the requested condition concretely influenced placement, reference preservation, or semantic emphasis.

Evaluation in this regime is mature but non-uniform. TIFA [50], GenEval [41], T2I-CompBench [42], and GenEval2 [109] expose omission, misbinding, counting, and relation failures that image–text similarity often misses. Grounded and editing settings also require layout adherence, pose or depth faithfulness, and preservation of unedited regions [33,69,70]. Adjacent resources such as ConceptBed [130] and Conceptual Captions [131] provide useful background, but are not primarily external-consistency evaluations.

External consistency matters for controllable creation, structure-aware design, interactive editing, and human-in-the-loop generation. Poster, layout, try-on, and dressing systems [132,133,134,135,136,137] rely on preserving supplied structure while changing requested content. Across applications, external consistency is best reported as condition traceability: which constraints affected the sample, where they were realized, and which ones were relaxed under conflict. Evaluation should report both target satisfaction and collateral change. The unresolved gap is conflict-aware condition arbitration, since current methods often strengthen individual controls without reliably resolving incompatible constraints.

4. Internal Consistency

Internal consistency concerns whether generated content remains compatible across parts, samples, views, or time. Unlike external consistency, which compares an output with an external condition, internal consistency tests whether generated states can be interpreted as the same subject, scene, or evolving process. A sample may look plausible in isolation but fail once related outputs are inspected together. We focus on cross-image, cross-view, and cross-time consistency, where multi-instance coordination is most visible. Intra-image self-compatibility is treated as a boundary case overlapping with structural control and physical plausibility, such as support, occlusion, and illumination consistency. Typical failures include identity drift, cross-view disagreement, video flicker, and loss of established entities or events in long narratives. The failures reflect unstable shared state rather than prompt mismatch.

4.1. Subject and Identity Consistency

Subject and identity consistency concerns whether a generator can preserve the recognizable identity of an object, person, or character while changing scene, pose, style, or action. It is one of the most visible forms of internal consistency because users often expect the “same subject” to persist across multiple prompts or related images.

One line of work uses subject-specific adaptation. Textual Inversion [138] and DreamBooth [34] bind a user-provided subject to a learned token or finetuned model parameters. Custom Diffusion [74] reduces the update footprint by adapting a small subset of cross-attention parameters, while Perfusion [139] uses key-locking and gated rank-1 updates to reduce over-specialization. These methods can bind identity strongly, but they usually require per-subject adaptation and may produce representations that are harder to reuse across tasks. A second line uses reference-aware identity conditioning. PhotoMaker [52], InstantID [76], and related systems preserve identity through reusable image features rather than heavy per-subject optimization. That makes identity preservation easier to combine with new prompts, styles, or editing objectives, although the strength of identity binding still depends on the reference encoder and on how identity features are fused into the generator. Figure 8 contrasts these two paradigms: DreamBooth [34] writes identity into model parameters through subject-specific finetuning and prior preservation, whereas PhotoMaker [52] keeps the backbone reusable and encodes multiple ID images into a stacked identity embedding at inference time. The trade-off is between stronger subject binding through adaptation and faster reuse through conditioning.

A third direction extends identity preservation from single-subject replication to multi-image continuity. Story-oriented systems such as StoryDiffusion [47], ConsiStory [77], and related consistent-subject generators show that stable identity across scenes requires more than matching facial features; clothing, accessories, silhouette, and role-defining attributes must also remain coherent. It is useful to separate appearance copying from identity persistence. Appearance copying reproduces visible reference details, whereas identity persistence preserves what makes the subject recognizable under changes in pose, background, or style. The latter better captures the practical requirement behind personalization, story visualization, and long-horizon character generation. It also exposes a recurring trade-off: stronger coupling improves persistence but can reduce editability, controllability, or diversity when the shared representation becomes too rigid.

The broader subject-consistency literature shows the same split. Parameter-efficient and relation-aware methods study how subjects or concepts can be bound with smaller updates [75,140,141,142], while context-rich and multi-image personalization methods emphasize reference fusion and reusable conditioning [143,144,145,146]. The trade-off involves accuracy, cost, state storage, and how easily identity can be edited, recombined, or reused.

The main mechanism-level distinction is where the identity state is stored. Finetuning-based methods store identity in model parameters or learned tokens, which can produce strong subject binding but often require per-subject optimization and may reduce editability. Reference-conditioning methods store identity in external features, making them faster and easier to reuse, but their identity strength depends on encoder quality and feature injection. Story-level methods must additionally preserve role, clothing, accessories, and relational context, not only face similarity. Identity consistency is not a single metric: face similarity, clothing preservation, subject-role persistence, and controllable variation should be reported separately when possible.

4.2. Multi-View and 3D Consistency

Multi-view and 3D consistency concerns whether outputs depicting the same object or scene from different viewpoints remain compatible in geometry, texture, and implied structure. The requirement is stricter than 2D subject persistence because viewpoint changes allow legitimate appearance variation while still requiring a shared 3D explanation. A system may generate several visually plausible views that cannot coexist as the same object.

Two broad strategies dominate this area. The first treats the task as novel-view prediction from a reference. Zero-1-to-3 [78] conditions generation on a source image and a target camera transformation, enabling novel-view synthesis without full 3D reconstruction. Follow-up systems such as Zero123++ [147] and Cascade-Zero123 [79] improve view coverage and refinement. The line is flexible, but cross-view agreement remains limited when views are generated independently or only weakly coupled.

The second strategy moves toward joint multi-view generation. SyncDreamer [80], MVDream [35], and related systems couple views during denoising, making agreement part of generation itself rather than a post-hoc reconstruction concern. The design is more expensive, but it directly addresses latent drift across views. Evaluation is equally important: MET3R [51] provides a metric-oriented view of geometric compatibility, while MVG-Bench [84] provides a benchmark setting for stress-testing multiview consistency. A multi-view system should be rewarded for images that can coexist under one geometric interpretation, not for plausibility alone.

SyncDreamer [80] makes the shift from independent view synthesis to synchronized multiview reverse diffusion explicit. It couples intermediate states across views through 3D-aware feature attention, treating multiview consistency as part of denoising rather than a downstream reconstruction issue. Compared with earlier novel-view methods such as Zero-1-to-3, this synchronization uses a structured setup, including a fixed set of target viewpoints around a single input image rather than arbitrary viewpoint control. Multi-view generation is a natural test case for consistency: viewpoint changes expose contradictions that single-view evaluation often hides, and once outputs are reconstructed, rendered, or reused as assets, cross-view agreement becomes a structural requirement rather than an aesthetic preference. Adjacent work on continuous 3D words [148], viewpoint textual inversion [149], and camera-viewpoint customization [150] extends identity-style persistence toward geometry-aware persistence. Data resources such as Objaverse [151] and ShapeNet provide scaffolds for multiview and world-grounded generation, but they do not by themselves solve cross-view consistency.

Multi-view consistency differs from ordinary identity preservation because the shared state must support view-dependent variation under a single geometry. Methods that generate views independently can preserve coarse appearance while violating 3D structure. Joint multi-view denoising, 3D-aware attention, and reconstruction-guided refinement make the shared state more explicit, but they increase memory cost and often restrict viewpoint flexibility. Evaluation should therefore distinguish view-level image quality from geometric compatibility: attractive views are insufficient if they cannot be reliably reconstructed, registered, or rendered as one object or scene.

4.3. Temporal and Narrative Consistency

Temporal and narrative consistency extends internal consistency to ordered sequences. Temporal consistency concerns short-horizon coherence among adjacent frames or states, including local identity, geometry, and motion continuity. Narrative consistency concerns longer-horizon stability of characters, objects, scene states, and event logic across frames, panels, or story beats. It can be viewed as temporal consistency with longer memory and a richer state space.

Narrative settings often include externally supplied story prompts, scripts, or episode descriptions. The internal-consistency question is whether the realized sequence remains compatible with itself after those conditions have been instantiated. The target is persistent compatibility among generated characters, objects, and states, not only prompt-to-output matching. The boundary is conceptual rather than operational, since prompting and state tracking are usually entangled in practical systems. The primary difficulty is error accumulation. A small deviation may propagate and amplify: clothing drifts, objects disappear, backgrounds change unexpectedly, or actions begin from states that were never established. Because diffusion generators often optimize local plausibility more easily than long-range continuity, sequence-level failures remain common even in visually strong systems.

Mechanistically, three patterns recur. The first adds motion or temporal modules to image-oriented diffusion backbones, as in AnimateDiff [36]. The second manipulates cross-frame attention and feature sharing; Video-P2P [86], FateZero [85], and StableV2V [129] preserve edited content by coupling frames during attention or shape propagation. The third introduces memory or story-state propagation; StoryDiffusion [47], TaleCrafter [87], StoryGPT-V [152], and MovieDreamer [89] combine planning, persistent identity cues, and hierarchical control. Figure 9 illustrates the spectrum between short-range temporal coupling and longer-horizon identity, attire, and scene-state preservation.

Both temporal and narrative consistency can be viewed as state-tracking problems. The generator must preserve enough latent or explicit state to remain compatible with what has already been generated while still allowing new content to enter. Table 3 summarizes representative method families by the state they stabilize and the drift they address.

Temporal-consistency methods show a similar progression. Early systems adapted image-diffusion priors to coherent motion [45,153,154,155,156]; later work makes coupling more explicit through motion coordination, trajectory control, identity preservation, decomposed motion conditioning, or temporally consistent autoencoding [157,158,159]. Story-oriented work extends the unit of consistency from adjacent frames to persistent entities, panel-level planning, and retrieval-augmented state [160].

Across these settings, methods differ mainly in what state they share and how they share it. Subject-centric systems store identity through adaptation or reference-aware conditioning [34,52,74,76,138,139]. Multi-view and temporal systems extend this idea through feature coupling and joint generation, using synchronized views, motion cues, shape-preserving video coupling, or narrative-state propagation [35,36,47,77,80,129]. The core distinction is the stabilized state: subject identity, scene geometry, motion state, or long-horizon story memory.

Evaluation must move from single-sample visual quality to set-level compatibility. Identity-preservation protocols only partially capture this problem, because an image can look like the same character while still drifting in accessories, pose-dependent cues, or role attributes. MVG-Bench [84] and MET3R [51] reveal cross-view incompatibility that per-image inspection misses. In longer sequences, evaluation must track persistent entities, recurring attributes, and state transitions across many steps rather than only adjacent-frame smoothness. Video segmentation, tracking, and video-language resources such as MeViS [161], TAO-style tracking resources [162], TubeLink [163], and MOSE [164] can support such diagnostics, although they are not generative benchmarks by themselves.

Applications show why internal consistency is more than an aesthetic preference. Personalization requires stable subject identity; story visualization requires coherent characters and object states; long-video generation requires memory of prior events; and multi-view or 3D pipelines require cross-view agreement strong enough for reconstruction, asset creation, and downstream graphics workflows [35,36,47,78,80,87,89].

The main question is how much shared state to impose. Strong coupling improves persistence but can reduce diversity, editability, and motion richness; weak coupling keeps generation flexible but leaves room for identity drift, geometric contradiction, or narrative forgetting. Evaluation should track both stability and allowed variation: subject recognizability apart from pose, style, and context diversity; image fidelity apart from reconstructability; and adjacent-frame smoothness apart from long-range state continuity. The unresolved gap is persistent yet editable shared state: current methods stabilize local identities, views, or frames, but still struggle to preserve state across long horizons without freezing variation.

5. Normative Consistency

Normative consistency concerns whether generated outputs satisfy standing evaluative criteria that are not fully specified by the prompt or by another generated sample. We use “normative” operationally: the reference is a standard that should remain active even when the user does not restate it, not necessarily a moral or human-value judgment. The boundary is therefore defined by the source of the criterion, not by its surface content. A requested style, object, or physical event is external when it is part of the prompt; broad preference, safety, policy compliance, physical plausibility, commonsense, or causal validity is normative when it functions as a default requirement for acceptable generation. The category matters because an output can be visually strong, prompt-faithful, and temporally smooth while still being unsafe, undesirable, physically implausible, causally incoherent, or unsuitable for deployment.

We further distinguish two forms of normative consistency. Human-centered normative consistency concerns agreement with human preference, aesthetic judgment, safety requirements, platform policies, or value constraints. Its evidence sources usually include human comparisons, learned reward models, safety classifiers, red-teaming protocols, or moderation criteria. World-centered normative consistency concerns agreement with physical plausibility, commonsense knowledge, causal structure, and action-conditioned state evolution. Its evidence sources usually include physics-aware benchmarks, world-knowledge tests, simulation-inspired diagnostics, causal checks, or human stress tests. The distinction matters because preference alignment, safety suppression, and physical causality rely on different signals and fail in different ways.

Normative consistency is also preservation-sensitive: improving preference, safety, or plausibility should not unnecessarily damage prompt fidelity, diversity, benign capability, identity preservation, or downstream usability.

5.1. Preference and Aesthetic Consistency

Preference and aesthetic consistency are human-centered normative requirements. The agreement target is a distribution of human judgments about visual quality, composition, style, clarity, or usefulness, rather than a literal prompt condition or a relation among generated samples. These signals are difficult to specify symbolically, but can be estimated from ratings, pairwise comparisons, or implicit user choices. The distinction from external consistency is important: a user-requested “oil painting” or “minimalist poster” is an external condition, whereas optimization toward broadly preferred outputs reflects a general evaluative criterion.

Work on preference consistency has two coupled components: evaluators and optimizers. On the evaluator side, Pick-a-Pic [93] provides human-comparison data, while ImageReward [53] turns supervision into a learned scorer. HPS [90], HPSv2 [165], HPSv3 [166], MPS [91], and VisionReward [92] refine the evaluator layer by improving calibration, robustness, or sensitivity. Preference optimization is only as reliable as the reward model that defines the target.

On the optimization side, methods differ in how directly they update the generator. As shown in Figure 10, Diffusion-DPO [37] converts pairwise preference into direct model updates, extending preference optimization ideas from language models to diffusion generation. Outputs sampled from the same prompt are compared in pairs, human preference labels determine the favored sample, and the resulting signal optimizes the diffusion model rather than only reranking completed generations. It separates preference alignment from ordinary prompt following. Related work uses reward-guided distillation, reranking, or training-time alignment to favor outputs scored highly by learned evaluators. Preference consistency does not reduce to prompt faithfulness: an output can satisfy the prompt yet score poorly under human preference, while a preferred output may deviate from a strict reading of the prompt.

Preference consistency introduces a softer target than exact semantic matching: agreement with a distribution of human judgments. The target is flexible, but harder to stabilize and generalize across users, tasks, and populations. Recent work moves beyond static reward scoring through reward-guided latent consistency distillation [167], reinforcement for few-step diffusion [168], and flow-map alignment [169], with the goal of preserving preference structure while reducing cost.

5.2. Safety and Value Consistency

Safety and value consistency are human-centered normative requirements focused on selective prevention rather than preference maximization. Here, value consistency operationally denotes deployment boundaries specified by platform rules, legal constraints, social norms, or chosen value criteria, rather than a complete treatment of value learning. Generation should remain within these boundaries while reducing harmful or prohibited outputs without damaging benign requests or general image quality.

Safety methods can be grouped by intervention locus. Inference-time guidance methods such as Safe Latent Diffusion (SLD) [38] steer sampling away from unsafe latent regions without retraining the backbone. This treats safety as trajectory steering rather than post-hoc rejection, and differs from output filtering or concept erasure. A second route edits the model more persistently. Unified Concept Editing [97], concept-erasing methods such as Erasing Concepts from Diffusion Models [99], and robust variants such as STEREO [172] frame safety as parameter-level removal or suppression of unwanted concepts. ACE and related anti-editing variants [98] ask whether harmful concepts can be removed while minimizing collateral semantic damage. A third route adds filtering, moderation, or human control, shifting part of the safety burden outside the generator.

Safety becomes a consistency problem when suppression must be selective. Benchmarks such as Six-CD [101] expose this dual requirement: a method should suppress unwanted generations while preserving behavior on benign prompts semantically adjacent to the erased concepts. Safe generation requires both restriction and retention. The same preservation-sensitive view extends beyond safety narrowly construed. In value-aware generation, the challenge is to shape behavior according to a normative target without introducing brittleness, over-refusal, or semantic forgetting.

Concept erasure and safety alignment can be grouped into three routes. Direct editing methods study concept ablation, large-scale concept editing [100], and analyses of whether concepts are actually erased [173]. Localized or continual methods target narrower edits, as in memories-of-forgotten-concepts style continual erasure [174], localized erasure [175], and pinpoint erasers [176]. Feedback- and preservation-aware variants study how to reduce harmful generations while retaining benign capability [177].

5.3. Physical, Commonsense, and Causal Consistency

Physical, commonsense, and causal consistency are world-centered normative requirements. The agreement target is not a human preference distribution or a safety policy, but a set of expectations about how objects, scenes, and events behave in the world. These requirements are not reducible to prompt faithfulness or internal continuity. A video may follow its prompt and remain temporally smooth while still showing impossible dynamics, implausible contacts, broken causality, or incoherent scene logic. We distinguish three overlapping levels. Physical consistency concerns plausible motions, contacts, deformations, and state transitions. Commonsense consistency concerns whether scenes and object interactions match ordinary world knowledge. Causal consistency concerns whether event order and state change remain intelligible under simple cause–effect reasoning. The hierarchy explains why world-grounded evaluation is harder than aesthetic scoring or prompt checking.

Recent benchmarks diagnose these failures beyond appearance and prompt relevance. Image-level resources such as PhyBench [102] probe static physical commonsense, while video benchmarks such as VideoPhy [39], VideoPhy-2 [103], PhyGenBench [54], T2VPhysBench [170], T2VWorldBench [104], PhyWorldBench [105], and VideoVerse [171] evaluate motion, contact, causality, or world-state plausibility. These resources expose errors that ordinary perceptual metrics often miss, including violations of contact logic, intuitive dynamics, conservation, and event causality. For video generation, simulation, planning, and embodied prediction, world-structural consistency affects whether outputs can be used beyond visual inspection. A model must maintain event structure that humans and downstream systems can reason about, rather than merely rendering convincing objects. Table 4 summarizes these normative targets and their operational routes.

Table 5 summarizes the diagnostic relevance of representative benchmarks, datasets, and evaluators. The ratings are qualitative rule-based labels, not dataset-quality scores or model-performance scores. We assign them using three checks: whether the resource’s intended diagnostic target matches the consistency claim, whether its observation unit matches the claim, and whether it provides dedicated tasks, annotations, metrics, or human judgments for that claim. A high rating (H) is assigned when all three checks are satisfied. A medium rating (M) is assigned when the resource partially diagnoses the claim or provides adaptable evidence, but was not primarily designed for it. A low rating (L) is assigned when the resource provides only indirect evidence, or when success on it does not reliably imply the corresponding form of consistency. The supplementary material provides the detailed rating protocol and representative assignment examples.

Table 5. Benchmark, dataset, and evaluator coverage for consistency in diffusion generation. P/C: prompt and compositional faithfulness; S/E: structural control and edit preservation; ID: subject/identity persistence; V/T: multi-view, temporal, or narrative coherence; N/S: preference, safety, or value consistency; P/W: physical, causal, or world-grounded plausibility. H/M/L indicate diagnostic coverage: H means direct evaluation with dedicated tasks, annotations, or metrics; M means partial support or adaptable evidence; L means only indirect relevance. The ratings describe relevance to each consistency claim rather than overall dataset quality. The code column lists the official repository when a stable public project repository is available; “–” indicates that no official GitHub repository is available.

Resource	Code	Type	Mod.	Primary	P/C	S/E	ID	V/T	N/S	P/W	Diagnostic use and blind spot
External Consistency Resources
TIFA [50]		Eval.	T2I	Ext.	H	M	L	L	L	L	QA-based prompt faithfulness; weak on identity, temporal, and world consistency.
GenEval [41]		Bench.	T2I	Ext.	H	M	L	L	L	L	Object-focused prompt benchmark; covers omission, binding, and counting, not editing or temporal consistency.
T2I-CompBench [42]		Bench.	T2I	Ext.	H	M	L	L	L	L	Broad compositional benchmark for relations and attributes; weaker on grounded control and preservation-based editing.
GenEval 2 [109]		Bench.	T2I	Ext.	H	M	L	L	L	L	Harder prompt-following resource addressing benchmark drift; still centered on image-level prompt faithfulness.
HRS-Bench [178]		Bench.	T2I	Ext.	H	M	L	L	M	L	Holistic T2I benchmark covering accuracy, robustness, generalization, fairness, and bias; useful for broad prompt-skill diagnostics but less specific to edit preservation.
DPG-Bench [179]		Bench.	T2I	Ext.	H	M	L	L	L	L	Dense-prompt benchmark for long and complex prompt following; strong for entity, attribute, and relation grounding, but not designed for editing or temporal consistency.
GenAI-Bench [180]		Bench./Eval.	T2I/T2V	Ext.	H	M	L	M	L	L	Compositional text-to-visual benchmark with VQA-style scoring; useful for external semantic agreement, but limited for identity, safety, and physical-state persistence.
EditBench [181]	–	Bench.	Edit	Ext.	M	H	L	L	L	L	Text-guided image inpainting benchmark; directly probes instruction adherence and preservation in masked editing, but less general for open-ended editing.
MagicBrush [182]		Dataset/Bench.	Edit	Ext.	M	H	L	L	L	L	Manually annotated instruction-guided editing dataset with single-turn, multi-turn, mask-provided, and mask-free settings; strong for edit consistency but not for world or temporal plausibility.
ConceptBed [130]		Bench.	T2I	Ext./Int.	M	M	M	L	L	L	Concept-learning and binding benchmark; useful bridge between prompt-level consistency and reusable subject concepts.
Internal Consistency Resources
MVG-Bench [84]		Bench.	T2I/3D	Int.	L	M	M	H	L	L	Dedicated benchmark for multi-view generation; emphasizes cross-view compatibility more than single-image realism.
MET3R [51]		Eval.	T2I/3D	Int.	L	L	M	H	L	L	Multi-view consistency metric; strong for geometry-aware agreement, weaker for narrative or preference alignment.
VBench [183]		Bench./Eval.	T2V	Int./Norm.	M	L	M	H	L	M	Comprehensive video-generation benchmark with dimensions such as subject consistency, background consistency, motion smoothness, and temporal flickering; broader than internal consistency but not designed for safety or preference alignment.
Video-Bench [184]		Bench.	T2V/I2V	Int./Norm.	M	L	M	H	M	M	Human-aligned video-generation benchmark; useful for action consistency, temporal consistency, and motion quality, but not designed specifically for identity-preserving story generation.
EvalCrafter [185]		Eval.	T2V	Int./Ext.	M	L	L	H	L	L	Unified evaluation toolkit for generated videos using visual, content, motion, and text-video alignment metrics; strong for video-level diagnostics but weaker for persistent identity, preference alignment, and causal state.
FETV [186]		Bench.	T2V	Ext./Int.	H	L	L	M	L	L	Fine-grained open-domain T2V benchmark; useful for prompt complexity, attribute control, and temporal generation quality, but less focused on long-horizon identity or narrative memory.
ViStoryBench [187]		Bench.	Story/T2I	Int.	M	L	H	H	L	L	Story-visualization benchmark focusing on character consistency, narrative coherence, and stylistic integrity across image sequences; strong for long-range internal consistency but not for physical dynamics.
MeViS [161]		Dataset	Video	Int.	L	L	L	M	L	L	Motion-expression segmentation; useful for temporal grounding diagnostics, but not a generative benchmark.
MOSE [164]		Dataset	Video	Int.	L	L	L	M	L	L	Video object segmentation resource for difficult scenes; useful for analyzing persistence under occlusion and scene change, but less direct than MeViS for motion-expression grounding.
TAO [188]		Dataset	Video	Int.	L	L	M	M	L	L	Large-scale tracking benchmark; useful for long-range identity diagnostics, but not diffusion-specific.
VSPW [189]		Dataset	Video	Int.	L	L	L	M	L	L	Video scene parsing resource; useful for scene-state continuity diagnostics rather than prompt following.
nuScenes [190]		Dataset	Video/3D	Int.	L	M	L	M	L	M	Multimodal driving dataset; useful for geometry and dynamics diagnostics, but not a standalone generative benchmark.
Normative Consistency Resources
Pick-a-Pic [93]		Dataset	T2I	Norm.	M	L	L	L	H	L	Pairwise preference data; strong for preference learning, indirect for compositional faithfulness.
ImageReward [53]		Eval.	T2I	Norm.	M	L	L	L	H	L	Learned preference reward; useful for ranking and optimization, but not for task-specific faithfulness.
HPS [90]	–	Eval.	T2I	Norm.	M	L	L	L	H	L	Human preference score; broad but coarse-grained.
HPSv2 [165]		Bench.	T2I	Norm.	M	L	L	L	H	L	Refined human-preference benchmark with stronger evaluation coverage than HPS.
HPSv3 [166]		Bench.	T2I	Norm.	M	L	L	L	H	L	Wide-spectrum human preference benchmark; still not designed for safety or world-consistency claims.
VisionReward [92]		Eval.	T2I/T2V	Norm.	M	L	L	M	H	L	Multi-dimensional image/video preference evaluator; broader than image-only rewards, but still preference-centered.
Six-CD [101]		Bench.	T2I	Norm.	L	M	L	L	H	L	Safety benchmark that jointly measures concept suppression and benign retention.
PhyBench [102]		Bench.	T2I	Norm.	L	L	L	L	L	H	Physical commonsense benchmark for text-to-image generation; useful for static world plausibility more than temporal dynamics.
VideoPhy [39]		Bench.	T2V	Norm.	L	L	L	M	L	H	Physical commonsense evaluation for generated videos.
PhyCoBench [191]		Bench.	T2V	Norm.	L	L	L	M	L	H	Optical-flow-guided physical coherence benchmark; focused on motion plausibility.
PhyGenBench [54]		Bench.	T2V	Norm.	L	L	L	M	L	H	Physical commonsense benchmark for video generation; emphasizes world simulation quality.
VideoPhy-2 [103]		Bench.	T2V	Norm.	L	L	L	M	L	H	Action-centric physical commonsense benchmark; strong for action consequences and physical interaction.
T2VPhysBench [170]	–	Bench.	T2V	Norm.	L	L	L	M	L	H	First-principles physical consistency benchmark for text-to-video.
T2VWorldBench [104]	–	Bench.	T2V	Norm.	L	L	L	M	L	H	World-knowledge benchmark covering commonsense and causal plausibility beyond local motion realism.
Physics-IQ [106]		Bench.	T2V	Norm.	L	L	L	M	L	H	Probes whether video generators internalize physical principles; diagnostic rather than end-to-end preference metric.
PhyWorldBench [105]		Bench.	T2V	Norm.	L	L	L	M	L	H	Comprehensive physical realism benchmark for text-to-video.
VideoVerse [171]		Bench.	T2V	Norm.	L	L	L	M	L	H	World-model-oriented T2V evaluation.
PhyEduVideo [192]		Dataset/Bench.	T2V	Norm.	L	L	L	M	L	H	Physics-education-oriented benchmark exposing explanatory and world-consistency gaps.

Adjacent work pushes this direction toward dynamic and simulator-like generation, including physics-based zero-shot video generation [193], fast autoregressive video diffusion and world-model variants [194], and motion-aware 3D animation pipelines. These systems move the target from single-frame plausibility toward action-conditioned state evolution over time.

These targets are heterogeneous, but they share a structural property: the generator is evaluated against a criterion not fully specified by the literal prompt. The mechanism families differ by the evidence they optimize or enforce. Preference consistency relies on reward modeling, reranking, and preference-oriented finetuning [37,53,90,92,93,166]; safety consistency uses safe guidance, concept editing, unlearning, and preservation-sensitive filtering [38,97,98,101]; and world-grounded consistency increasingly uses verifier-like or benchmark-driven loops that judge plausible dynamics and state changes rather than image realism alone [102,103,104,105,171].

Normative evaluation must remain plural. Preference metrics such as ImageReward [53] and HPSv3 [166] capture agreement with human judgments, but they do not replace faithfulness or safety evaluation. Safety benchmarks such as Six-CD [101] expose the tension between suppressing unsafe concepts and preserving benign utility. World-grounded resources such as PhyBench [102], VideoPhy-2 [103], T2VPhysBench [170], T2VWorldBench [104], and Physics-IQ [106] reveal physically or causally inconsistent futures that may still look locally realistic. The evaluator should match the normative claim being tested.

Deployment settings often combine these normative criteria. Public-facing systems need preference-aware ranking and safety moderation; creative assistants need adaptive preferences; and world-grounded video, simulation, or embodied settings require outputs whose utility is not limited to visual plausibility. These directions increasingly intersect with world-model pipelines, physics-aware video generation, and world-benchmark resources [171,193,194]. The unresolved gap is verifier-compatible normative control: current systems can score, filter, or steer outputs, but rarely expose state representations that explain why a preference, safety, or world constraint was satisfied or violated.

The ratings follow the intended diagnostic use of each resource. GenEval/GenEval2 [41,109] are high for prompt and compositional faithfulness because they target object presence, attribute binding, counting, and spatial relations, but low for temporal or world-level claims because they observe single images. MVG-Bench [84] is high for multi-view consistency; ImageReward [53] and HPS-style evaluators [90] are high for preference-related normative consistency but partial for prompt faithfulness. Six-CD is high for safety because it jointly evaluates unsafe concept suppression and benign retention. Tracking or segmentation datasets such as TAO [188], MOSE [164], and VSPW [189] are medium for temporal or narrative consistency because they provide state-tracking annotations but are not generative benchmarks.

Table 5 shows two patterns. Coverage is uneven: external resources are strongest on prompt and compositional faithfulness, internal resources on view or sequence compatibility, and normative resources on preference, safety, and world-plausibility claims. No benchmark family jointly covers prompt faithfulness, structural preservation, identity persistence, temporal coherence, and physical, world-level, or policy consistency. This fragmentation explains why cross-relation claims require complementary protocols.

6. Open Problems and Future Directions

The taxonomy also helps explain where current work remains limited. Relation overlap leads to conflict-aware multi-target optimization; different observation units and evidence sources call for decomposed evaluation; internal consistency exposes the need for persistent shared state; world-centered normative consistency requires verifier-compatible physical and causal representations; and optimization-locus diversity motivates composable mechanisms. The subsections below use these links to organize the research agenda.

6.1. Conflict-Aware Multi-Target Optimization

The first challenge is relation overlap: the same output may need to satisfy external, internal, and normative consistency at once. Most methods still optimize one target at a time, whereas practical systems must combine prompt faithfulness, identity preservation, temporal coherence, safety, preference, and physical plausibility within the same generation process and user interaction.

Conflict-aware optimization requires more than independent losses or guidance modules. Objectives can interfere: identity preservation may reduce editability [34,74], safety constraints may suppress benign adjacent concepts [38,195,196], preference optimization may weaken prompt faithfulness [37,50], and temporal smoothing may reduce motion diversity [36,153]. A practical system needs mechanisms to represent priorities, detect conflicts, and decide which constraint should dominate in context. Key questions include how to represent priorities among consistency targets, how to detect conflicts during sampling or editing, and how to expose these decisions to users or downstream systems.

6.2. Decomposed and Interpretable Evaluation

The evaluation gap comes from mismatched observation units and evidence sources. Prompt benchmarks mainly test single-image semantic faithfulness [41,42,50,109]; multi-view resources expose cross-view compatibility [51,84]; preference evaluators capture human-centered normative signals [53,90]; and physical benchmarks target world-centered failures [103,104,105]. A single aggregate score can hide which relation improved and which failure mode remains unresolved.

Metric design should support both comparison and diagnosis. A universal scalar metric would obscure important failure modes; more useful reporting would combine decomposed sub-scores, verifier-style audits, counterfactual stress tests, and set-level diagnostics. Reporting consistency as a structured profile would make gains and regressions across relations easier to see. Open questions remain: how to standardize such profiles, separate genuine gains from easier test cases, and audit regressions across relations.

6.3. Persistent Shared State and User Memory

Internal consistency also exposes a memory bottleneck. Current methods preserve subject identity, geometry, motion, layout, or story state [34,36,46,47,52,74,80,153,197], but these states are often local, task-specific, or short-lived. Story generation, long-video synthesis, recurring personalization, and user-adaptive editing require persistent state across prompts, scenes, and user sessions [47,89]. Even recent long-range character or story systems [47,152,198] still drift as identities gradually change, scene states are overwritten, or user preferences are not reliably retained across sessions.

The main challenge is selective memory, not memory size alone. A useful mechanism must retain task-relevant identity, scene, preference, and narrative state while allowing local edits, new events, and style variation. This requires separating long-lived global memory from short-lived local constraints, as suggested by hierarchical and memory-conditioned designs [46,47]. Key questions include what should be stored, when stale state should be overwritten, and how memory use should remain controllable when prompts introduce new identities, scenes, or preferences.

6.4. World-Grounded Physical and Causal Consistency

World-centered normative consistency raises a different problem. As diffusion models move toward video, simulation, and embodied world modeling [14,45,199,200], a generated video may be prompt-relevant and temporally smooth while still violating dynamics, contact logic, object permanence, or action consequences [39,54]. Benchmarks expose these failures, but most generators still lack explicit state variables or verifier-compatible physical and causal representations. The gap matters especially when generated trajectories are used for planning, control, or downstream reasoning [14,200].

Progress requires tighter connections between generative models and world priors, whether through geometry, dynamics, action-conditioned transitions, simulator constraints, verifier loops [35,80], or large-scale generative pretraining [14,199]. The unresolved problem is how to preserve geometry, affordances, and causal state transitions rather than only framewise visual realism. Key questions include which state variables should be made explicit, how verifiers should interact with sampling, and how physical or causal consistency should be evaluated beyond short clips [39,54,200].

6.5. Composable Mechanisms

The optimization-locus axis also reveals a composability gap. Control branches, identity adapters, temporal modules, concept editing, safe guidance, and preference optimization each improve a narrow consistency target at a particular locus [32,34,36,38,48,53,59,74,93,99,153]. These mechanisms are not compositional by default: subject-specific adaptation may not transfer across users [34,52,74], long-context or video-level methods are computationally expensive [45,199], and safety or preference modules are often target-specific [37,38,99].

A reusable consistency mechanism should be attachable, composable, and replaceable according to the required relation. Such modules would preserve pretrained diffusion backbones while adapting identity, geometry, motion, preference, or safety behavior without full retraining [32,63,74]. The open questions are how modules should declare the relation they enforce, how conflicts among modules should be explicitly mediated, and how composition affects efficiency, reliability, and user control.

7. Conclusion

As diffusion models move from single-image synthesis toward editing, personalization, video, multi-view generation, 3D content creation, and world-grounded simulation, visual realism alone is no longer sufficient. Generated content must follow external conditions, remain coherent across related samples, views, and temporal states, and satisfy human-centered and world-centered evaluative constraints.

We organized diffusion-based visual generation around three consistency relations. External consistency captures agreement with prompts, controls, references, and editing instructions; internal consistency captures shared-state preservation across identities, views, frames, scenes, and narratives; and normative consistency captures agreement with preference, safety, physical plausibility, commonsense, and causal structure. Four auxiliary axes–observation unit, agreement target, optimization locus, and evidence source–connect these relations to mechanisms, metrics, benchmarks, failure modes, and trade-offs.

The relation-centered view explains why consistency should be reported as a decomposed profile rather than a single quality score: gains in prompt following, identity preservation, safety, preference, temporal stability, or physical plausibility can conflict. Future progress will depend on stronger modules for individual relations and on making cross-relation conflicts observable, evaluable, and controllable across structured creation, long-form generation, and world-grounded simulation.

Supplementary Materials

The following supporting information can be downloaded at the website of this paper posted on Preprints.org.

References

Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the NeurIPS; Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; Lin, H., Eds., 2020.
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the ICLR. OpenReview.net, 2021.
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. In Proceedings of the ICLR. OpenReview.net, 2021.
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the CVPR. IEEE, 2022, pp. 10674–10685. [CrossRef]
Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; Rombach, R. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In Proceedings of the ICLR. OpenReview.net, 2024.
Labs, B.F.; Batifol, S.; Blattmann, A.; Boesel, F.; Consul, S.; Diagne, C.; Dockhorn, T.; English, J.; English, Z.; Esser, P.; et al. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space. CoRR 2025, abs/2506.15742, [2506.15742]. [CrossRef]
Wu, C.; Li, J.; Zhou, J.; Lin, J.; Gao, K.; Yan, K.; Yin, S.; Bai, S.; Xu, X.; Chen, Y.; et al. Qwen-Image Technical Report. CoRR 2025, abs/2508.02324, [2508.02324]. [CrossRef]
Huang, Y.; Huang, J.; Liu, Y.; Yan, M.; Lv, J.; Liu, J.; Xiong, W.; Zhang, H.; Cao, L.; Chen, S. Diffusion Model-Based Image Editing: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4409–4437. [CrossRef]
Zhang, X.; Wei, X.; Hu, W.; Wu, J.; Wu, J.; Zhang, W.; Zhang, Z.; Lei, Z.; Li, Q. A Survey on Personalized Content Synthesis with Diffusion Models. Mach. Intell. Res. 2025, 22, 817–848. [CrossRef]
Gao, Y.; Guo, H.; Hoang, T.; Huang, W.; Jiang, L.; Kong, F.; Li, H.; Li, J.; Li, L.; Li, X.; et al. Seedance 1.0: Exploring the Boundaries of Video Generation Models. CoRR 2025, abs/2506.09113, [2506.09113]. [CrossRef]
Wang, A.; Ai, B.; Wen, B.; Mao, C.; Xie, C.; Chen, D.; Yu, F.; Zhao, H.; Yang, J.; Zeng, J.; et al. Wan: Open and Advanced Large-Scale Video Generative Models. CoRR 2025, abs/2503.20314, [2503.20314]. [CrossRef]
Xiang, J.; Lv, Z.; Xu, S.; Deng, Y.; Wang, R.; Zhang, B.; Chen, D.; Tong, X.; Yang, J. Structured 3D Latents for Scalable and Versatile 3D Generation. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 21469–21480. [CrossRef]
Zhao, Z.; Lai, Z.; Lin, Q.; Zhao, Y.; Liu, H.; Yang, S.; Feng, Y.; Yang, M.; Zhang, S.; Yang, X.; et al. Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. CoRR 2025, abs/2501.12202, [2501.12202]. [CrossRef]
Bruce, J.; Dennis, M.D.; Edwards, A.; Parker-Holder, J.; Shi, Y.; Hughes, E.; Lai, M.; Mavalankar, A.; Steigerwald, R.; Apps, C.; et al. Genie: Generative Interactive Environments. In Proceedings of the ICML; Salakhutdinov, R.; Kolter, Z.; Heller, K.A.; Weller, A.; Oliver, N.; Scarlett, J.; Berkenkamp, F., Eds. PMLR / OpenReview.net, 2024, Proceedings of Machine Learning Research, pp. 4603–4623.
Team, H.; Wang, Z.; Liu, Y.; Wu, J.; Gu, Z.; Wang, H.; Zuo, X.; Huang, T.; Li, W.; Zhang, S.; et al. HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels. CoRR 2025, abs/2507.21809, [2507.21809]. [CrossRef]
Zhang, C.; Zhang, C.; Zhang, M.; Kweon, I.S. Text-to-image Diffusion Models in Generative AI: A Survey. CoRR 2023, abs/2303.07909, [2303.07909]. [CrossRef]
Cao, P.; Zhou, F.; Song, Q.; Yang, L. Controllable Generation With Text-to-Image Diffusion Models: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 4771–4791. [CrossRef]
Wang, Y.; Liu, X.; Pang, W.; Ma, L.; Yuan, S.; Debevec, P.E.; Yu, N. Survey of Video Diffusion Models: Foundations, Implementations, and Applications. Trans. Mach. Learn. Res. 2025, 2025.
Elmoghany, M.; Rossi, R.A.; Yoon, S.; Mukherjee, S.; Bakr, E.M.; Mathur, P.; Wu, G.; Lai, V.D.; Lipka, N.; Zhang, R.; et al. A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality. In Proceedings of the ICCV Workshops. IEEE, 2025, pp. 7082–7094. [CrossRef]
Liu, B.; Shao, S.; Li, B.; Bai, L.; Xu, Z.; Xiong, H.; Kwok, J.T.; Helal, S.; Xie, Z. Alignment of Diffusion Models: Fundamentals, Challenges, and Future. ACM Comput. Surv. 2026, 58, 244:1–244:37. [CrossRef]
Wang, Z.; Li, D.; Jiang, R. Diffusion Models in 3D Vision: A Survey. CoRR 2024, abs/2410.04738, [2410.04738]. [CrossRef]
Miao, Q.; Li, K.; Quan, J.; Min, Z.; Ma, S.; Xu, Y.; Yang, Y.; Luo, Y. Advances in 4D Generation: A Survey. CoRR 2025, abs/2503.14501, [2503.14501]. [CrossRef]
Liu, D.; Zhang, J.; Dinh, A.; Park, E.; Zhang, S.; Xu, C. Generative Physical AI in Vision: A Survey. CoRR 2025, abs/2501.10928, [2501.10928]. [CrossRef]
Hartwig, S.; Engel, D.; Sick, L.; Kniesel, H.; Payer, T.; Poonam, P.; Glöckler, M.; Bäuerle, A.; Ropinski, T. A Survey on Quality Metrics for Text-to-Image Generation. TVCG 2025, 31, 9464–9483. [CrossRef]
Xu, Y.; Zhang, J.; Salemi, A.; Hu, X.; Wang, W.; Feng, F.; Zamani, H.; He, X.; Chua, T. Personalized Generation In Large Model Era: A Survey. In Proceedings of the ACL; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds. Association for Computational Linguistics, 2025, pp. 24607–24649.
Wei, Y.; Zheng, Y.; Zhang, Y.; Liu, M.; Ji, Z.; Zhang, L.; Zuo, W. Personalized Image Generation with Deep Generative Models: A Decade Survey. Comput. Vis. Media 2025, 11, 1141–1194. [CrossRef]
Xing, Z.; Feng, Q.; Chen, H.; Dai, Q.; Hu, H.; Xu, H.; Wu, Z.; Jiang, Y. A Survey on Video Diffusion Models. ACM Comput. Surv. 2025, 57, 41:1–41:42. [CrossRef]
Zhang, Y.; Chen, Z.; Cheng, C.; Ruan, W.; Huang, X.; Zhao, D.; Flynn, D.; Khastgir, S.; Zhao, X. Trustworthy text-to-image diffusion models: A timely and focused survey. Inf. Fusion 2026, 133, 104264. [CrossRef]
Truong, V.T.; Dang, B.L.; Le, L.B. Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey. ACM Comput. Surv. 2025, 57, 216:1–216:44. [CrossRef]
Li, X.; He, X.; Zhang, L.; Liu, Y. A Comprehensive Survey on World Models for Embodied AI. CoRR 2025, abs/2510.16732, [2510.16732]. [CrossRef]
Chefer, H.; Alaluf, Y.; Vinker, Y.; Wolf, L.; Cohen-Or, D. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. ACM Trans. Graph. 2023, 42, 148:1–148:10. [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the ICCV. IEEE, 2023, pp. 3813–3824. [CrossRef]
Brooks, T.; Holynski, A.; Efros, A.A. InstructPix2Pix: Learning to Follow Image Editing Instructions. In Proceedings of the CVPR. IEEE, 2023, pp. 18392–18402. [CrossRef]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the CVPR. IEEE, 2023, pp. 22500–22510. [CrossRef]
Shi, Y.; Wang, P.; Ye, J.; Mai, L.; Li, K.; Yang, X. MVDream: Multi-view Diffusion for 3D Generation. In Proceedings of the ICLR. OpenReview.net, 2024.
Guo, Y.; Yang, C.; Rao, A.; Liang, Z.; Wang, Y.; Qiao, Y.; Agrawala, M.; Lin, D.; Dai, B. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. In Proceedings of the ICLR. OpenReview.net, 2024.
Wallace, B.; Dang, M.; Rafailov, R.; Zhou, L.; Lou, A.; Purushwalkam, S.; Ermon, S.; Xiong, C.; Joty, S.; Naik, N. Diffusion Model Alignment Using Direct Preference Optimization. In Proceedings of the CVPR. IEEE, 2024, pp. 8228–8238. [CrossRef]
Schramowski, P.; Brack, M.; Deiseroth, B.; Kersting, K. Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 22522–22531. [CrossRef]
Bansal, H.; Lin, Z.; Xie, T.; Zong, Z.; Yarom, M.; Bitton, Y.; Jiang, C.; Sun, Y.; Chang, K.; Grover, A. VideoPhy: Evaluating Physical Commonsense for Video Generation. In Proceedings of the ICLR. OpenReview.net, 2025.
Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. Consistency Models. In Proceedings of the ICML; Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; Scarlett, J., Eds. PMLR, 2023, Proceedings of Machine Learning Research, pp. 32211–32252.
Ghosh, D.; Hajishirzi, H.; Schmidt, L. GenEval: An object-focused framework for evaluating text-to-image alignment. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
Huang, K.; Sun, K.; Xie, E.; Li, Z.; Liu, X. T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Prompt-to-Prompt Image Editing with Cross-Attention Control. In Proceedings of the ICLR. OpenReview.net, 2023.
Kim, G.; Park, H.; Kim, T. Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift. In Proceedings of the ICLR, 2026.
Blattmann, A.; Rombach, R.; Ling, H.; Dockhorn, T.; Kim, S.W.; Fidler, S.; Kreis, K. Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 22563–22575. [CrossRef]
Rahman, T.; Lee, H.; Ren, J.; Tulyakov, S.; Mahajan, S.; Sigal, L. Make-A-Story: Visual Memory Conditioned Consistent Story Generation. In Proceedings of the CVPR. IEEE, 2023, pp. 2493–2502.
Zhou, Y.; Zhou, D.; Cheng, M.; Feng, J.; Hou, Q. StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation. In Proceedings of the NeurIPS; Globersons, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.M.; Zhang, C., Eds., 2024.
Li, Y.; Liu, H.; Wu, Q.; Mu, F.; Yang, J.; Gao, J.; Li, C.; Lee, Y.J. GLIGEN: Open-Set Grounded Text-to-Image Generation. In Proceedings of the CVPR. IEEE, 2023, pp. 22511–22521. [CrossRef]
Shin, J.; Li, Z.; Zhang, R.; Zhu, J.; Park, J.; Schechtman, E.; Huang, X. MotionStream: Real-Time Video Generation with Interactive Motion Controls. CoRR 2025, abs/2511.01266, [2511.01266]. [CrossRef]
Hu, Y.; Liu, B.; Kasai, J.; Wang, Y.; Ostendorf, M.; Krishna, R.; Smith, N.A. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. In Proceedings of the ICCV. IEEE, 2023, pp. 20349–20360. [CrossRef]
Asim, M.; Wewer, C.; Wimmer, T.; Schiele, B.; Lenssen, J.E. MET3R: Measuring Multi-View Consistency in Generated Images. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 6034–6044. [CrossRef]
Li, Z.; Cao, M.; Wang, X.; Qi, Z.; Cheng, M.; Shan, Y. PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding. In Proceedings of the CVPR. IEEE, 2024, pp. 8640–8650. [CrossRef]
Xu, J.; Liu, X.; Wu, Y.; Tong, Y.; Li, Q.; Ding, M.; Tang, J.; Dong, Y. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
Meng, F.; Liao, J.; Tan, X.; Lu, Q.; Shao, W.; Zhang, K.; Cheng, Y.; Li, D.; Luo, P. Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation. In Proceedings of the ICML; Singh, A.; Fazel, M.; Hsu, D.; Lacoste-Julien, S.; Berkenkamp, F.; Maharaj, T.; Wagstaff, K.; Zhu, J., Eds. PMLR / OpenReview.net, 2025, Proceedings of Machine Learning Research.
Lin, H.; Cho, J.; Zala, A.; Bansal, M. Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model. In Proceedings of the ICLR. OpenReview.net, 2025.
Guo, X.; Ma, X.; Zhang, H.; Huang, D. CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration. CoRR 2026, abs/2603.20741, [2603.20741]. [CrossRef]
Lee, K.; Li, X.; Wang, Q.; He, J.; Ke, J.; Yang, M.; Essa, I.; Shin, J.; Yang, F.; Li, Y. Calibrated Multi-Preference Optimization for Aligning Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 2025, pp. 18465–18475. [CrossRef]
Xie, J.; Li, Y.; Huang, Y.; Liu, H.; Zhang, W.; Zheng, Y.; Shou, M.Z. BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion. In Proceedings of the ICCV. IEEE, 2023, pp. 7418–7427. [CrossRef]
Huang, L.; Chen, D.; Liu, Y.; Shen, Y.; Zhao, D.; Zhou, J. Composer: Creative and Controllable Image Synthesis with Composable Conditions. In Proceedings of the ICML; Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; Scarlett, J., Eds. PMLR, 2023, Proceedings of Machine Learning Research, pp. 13753–13773.
Binyamin, L.; Tewel, Y.; Segev, H.; Hirsch, E.; Rassin, R.; Chechik, G. Make It Count: Text-to-Image Generation with an Accurate Number of Objects. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 13242–13251. [CrossRef]
Li, Y.; Wan, P.; Han, L.; Wang, Y.; Nie, L.; Zhang, M. CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion. CoRR 2025, abs/2505.04347, [2505.04347]. [CrossRef]
Zeng, G.; Zhang, X.; Wang, Z.; Xu, H.; Chen, Z.; Li, B.; Tu, Z. YOLO-Count: Differentiable Object Counting for Text-to-Image Generation. CoRR 2025, abs/2508.00728, [2508.00728]. [CrossRef]
Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y. T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models. In Proceedings of the AAAI; Wooldridge, M.J.; Dy, J.G.; Natarajan, S., Eds. AAAI Press, 2024, pp. 4296–4304. [CrossRef]
Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. CoRR 2023, abs/2308.06721, [2308.06721]. [CrossRef]
Chen, X.; Huang, L.; Liu, Y.; Shen, Y.; Zhao, D.; Zhao, H. AnyDoor: Zero-shot Object-level Image Customization. In Proceedings of the CVPR. IEEE, 2024, pp. 6593–6602. [CrossRef]
Yu, J.; Wang, Y.; Zhao, C.; Ghanem, B.; Zhang, J. FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model. In Proceedings of the ICCV. IEEE, 2023, pp. 23117–23127. [CrossRef]
Ju, X.; Zeng, A.; Zhao, C.; Wang, J.; Zhang, L.; Xu, Q. HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation. In Proceedings of the ICCV. IEEE, 2023, pp. 15942–15952. [CrossRef]
Joung, W.; Chae, D.; Kim, J. SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet. CoRR 2025, abs/2509.21938, [2509.21938]. [CrossRef]
Couairon, G.; Verbeek, J.; Schwenk, H.; Cord, M. DiffEdit: Diffusion-based semantic image editing with mask guidance. In Proceedings of the ICLR. OpenReview.net, 2023.
Geng, Z.; Yang, B.; Hang, T.; Li, C.; Gu, S.; Zhang, T.; Bao, J.; Zhang, Z.; Li, H.; Hu, H.; et al. InstructDiffusion: A Generalist Modeling Interface for Vision Tasks. In Proceedings of the CVPR. IEEE, 2024, pp. 12709–12720. [CrossRef]
Mokady, R.; Hertz, A.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Null-text Inversion for Editing Real Images using Guided Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 6038–6047. [CrossRef]
Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-Based Real Image Editing with Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 6007–6017. [CrossRef]
Yang, B.; Gu, S.; Zhang, B.; Zhang, T.; Chen, X.; Sun, X.; Chen, D.; Wen, F. Paint by Example: Exemplar-based Image Editing with Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 18381–18391. [CrossRef]
Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; Zhu, J. Multi-Concept Customization of Text-to-Image Diffusion. In Proceedings of the CVPR. IEEE, 2023, pp. 1931–1941. [CrossRef]
Li, D.; Li, J.; Hoi, S.C.H. BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
Wang, Q.; Bai, X.; Wang, H.; Qin, Z.; Chen, A. InstantID: Zero-shot Identity-Preserving Generation in Seconds. CoRR 2024, abs/2401.07519, [2401.07519]. [CrossRef]
Tewel, Y.; Kaduri, O.; Gal, R.; Kasten, Y.; Wolf, L.; Chechik, G.; Atzmon, Y. Training-Free Consistent Text-to-Image Generation. ACM Trans. Graph. 2024, 43, 52:1–52:18. [CrossRef]
Liu, R.; Wu, R.; Hoorick, B.V.; Tokmakov, P.; Zakharov, S.; Vondrick, C. Zero-1-to-3: Zero-shot One Image to 3D Object. In Proceedings of the ICCV. IEEE, 2023, pp. 9264–9275. [CrossRef]
Chen, Y.; Fang, J.; Huang, Y.; Yi, T.; Zhang, X.; Xie, L.; Wang, X.; Dai, W.; Xiong, H.; Tian, Q. Cascade-Zero123: One Image to Highly Consistent 3D with Self-prompted Nearby Views. In Proceedings of the ECCV; Leonardis, A.; Ricci, E.; Roth, S.; Russakovsky, O.; Sattler, T.; Varol, G., Eds. Springer, 2024, Lecture Notes in Computer Science, pp. 311–330. [CrossRef]
Liu, Y.; Lin, C.; Zeng, Z.; Long, X.; Liu, L.; Komura, T.; Wang, W. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image. In Proceedings of the ICLR. OpenReview.net, 2024.
Long, X.; Guo, Y.; Lin, C.; Liu, Y.; Dou, Z.; Liu, L.; Ma, Y.; Zhang, S.; Habermann, M.; Theobalt, C.; et al. Wonder3D: Single Image to 3D Using Cross-Domain Diffusion. In Proceedings of the CVPR. IEEE, 2024, pp. 9970–9980. [CrossRef]
Höllein, L.; Bozic, A.; Müller, N.; Novotný, D.; Tseng, H.; Richardt, C.; Zollhöfer, M.; Nießner, M. ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models. In Proceedings of the CVPR. IEEE, 2024, pp. 5043–5052. [CrossRef]
Kong, X.; Liu, S.; Lyu, X.; Taher, M.; Qi, X.; Davison, A.J. EscherNet: A Generative Model for Scalable View Synthesis. In Proceedings of the CVPR. IEEE, 2024, pp. 9503–9513. [CrossRef]
Xie, X.; Zou, C.; Karumuri, M.G.; Lenssen, J.E.; Pons-Moll, G. MVGBench: Comprehensive Benchmark for Multi-view Generation Models. CoRR 2025, abs/2507.00006, [2507.00006]. [CrossRef]
Qi, C.; Cun, X.; Zhang, Y.; Lei, C.; Wang, X.; Shan, Y.; Chen, Q. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. In Proceedings of the ICCV. IEEE, 2023, pp. 15886–15896. [CrossRef]
Liu, S.; Zhang, Y.; Li, W.; Lin, Z.; Jia, J. Video-P2P: Video Editing with Cross-Attention Control. In Proceedings of the CVPR. IEEE, 2024, pp. 8599–8608. [CrossRef]
Gong, Y.; Pang, Y.; Cun, X.; Xia, M.; He, Y.; Chen, H.; Wang, L.; Zhang, Y.; Wang, X.; Shan, Y.; et al. TaleCrafter: Interactive Story Visualization with Multiple Characters. CoRR 2023, abs/2305.18247, [2305.18247]. [CrossRef]
Liu, T.; Wang, K.; Li, S.; van de Weijer, J.; Khan, F.S.; Yang, S.; Wang, Y.; Yang, J.; Cheng, M. One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt. In Proceedings of the ICLR. OpenReview.net, 2025.
Zhao, C.; Liu, M.; Wang, W.; Chen, W.; Wang, F.; Chen, H.; Zhang, B.; Shen, C. MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences. In Proceedings of the ICLR. OpenReview.net, 2025.
Wu, X.; Sun, K.; Zhu, F.; Zhao, R.; Li, H. Human Preference Score: Better Aligning Text-to-image Models with Human Preference. In Proceedings of the ICCV. IEEE, 2023, pp. 2096–2105. [CrossRef]
Zhang, S.; Wang, B.; Wu, J.; Li, Y.; Gao, T.; Zhang, D.; Wang, Z. Learning Multi-Dimensional Human Preference for Text-to-Image Generation. In Proceedings of the CVPR. IEEE, 2024, pp. 8018–8027. [CrossRef]
Xu, J.; Huang, Y.; Cheng, J.; Yang, Y.; Xu, J.; Wang, Y.; Duan, W.; Yang, S.; Jin, Q.; Li, S.; et al. VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation. In Proceedings of the AAAI; Koenig, S.; Jenkins, C.; Taylor, M.E., Eds. AAAI Press, 2026, pp. 11269–11277. [CrossRef]
Kirstain, Y.; Polyak, A.; Singer, U.; Matiana, S.; Penna, J.; Levy, O. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
Liang, Z.; Yuan, Y.; Gu, S.; Chen, B.; Hang, T.; Cheng, M.; Li, J.; Zheng, L. Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 13199–13208. [CrossRef]
Liu, J.; Liu, G.; Liang, J.; Li, Y.; Liu, J.; Wang, X.; Wan, P.; Zhang, D.; Ouyang, W. Flow-GRPO: Training Flow Matching Models via Online RL. CoRR 2025, abs/2505.05470, [2505.05470]. [CrossRef]
Zheng, K.; Chen, H.; Ye, H.; Wang, H.; Zhang, Q.; Jiang, K.; Su, H.; Ermon, S.; Zhu, J.; Liu, M. DiffusionNFT: Online Diffusion Reinforcement with Forward Process. CoRR 2025, abs/2509.16117, [2509.16117]. [CrossRef]
Gandikota, R.; Orgad, H.; Belinkov, Y.; Materzynska, J.; Bau, D. Unified Concept Editing in Diffusion Models. In Proceedings of the WACV. IEEE, 2024, pp. 5099–5108. [CrossRef]
Wang, Z.; Wei, Y.; Li, F.; Pei, R.; Xu, H.; Zuo, W. ACE: Anti-Editing Concept Erasure in Text-to-Image Models. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 23505–23515. [CrossRef]
Gandikota, R.; Materzynska, J.; Fiotto-Kaufman, J.; Bau, D. Erasing Concepts from Diffusion Models. In Proceedings of the ICCV. IEEE, 2023, pp. 2426–2436. [CrossRef]
Xiong, T.; Wu, Y.; Xie, E.; Wu, Y.; Li, Z.; Liu, X. Editing Massive Concepts in Text-to-Image Diffusion Models. CoRR 2024, abs/2403.13807, [2403.13807]. [CrossRef]
Ren, J.; Chen, K.; Cui, Y.; Zeng, S.; Liu, H.; Xing, Y.; Tang, J.; Lyu, L. Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models. CoRR 2024, abs/2406.14855, [2406.14855]. [CrossRef]
Meng, F.; Shao, W.; Luo, L.; Wang, Y.; Chen, Y.; Lu, Q.; Yang, Y.; Yang, T.; Zhang, K.; Qiao, Y.; et al. PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models. CoRR 2024, abs/2406.11802, [2406.11802]. [CrossRef]
Bansal, H.; Peng, C.; Bitton, Y.; Goldenberg, R.; Grover, A.; Chang, K. VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation. CoRR 2025, abs/2503.06800, [2503.06800]. [CrossRef]
Chen, Y.; Guo, X.; Shi, Z.; Song, Z.; Zhang, J. T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation. CoRR 2025, abs/2507.18107, [2507.18107]. [CrossRef]
Gu, J.; Liu, X.; Zeng, Y.; Nagarajan, A.; Zhu, F.; Hong, D.; Fan, Y.; Yan, Q.; Zhou, K.; Liu, M.; et al. "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models. CoRR 2025, abs/2507.13428, [2507.13428]. [CrossRef]
Motamed, S.; Culp, L.; Swersky, K.; Jaini, P.; Geirhos, R. Do generative video models learn physical principles from watching videos? CoRR 2025, abs/2501.09038, [2501.09038]. [CrossRef]
Han, X.; Zhu, B.; Hu, S.; Li, F.M.; Carrington, P.; Zimmermann, R.; Chen, J. OSCBench: Benchmarking Object State Change in Text-to-Video Generation. CoRR 2026, abs/2603.11698, [2603.11698]. [CrossRef]
Choi, Y.; Park, C.; Baek, S.J. DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis. In Proceedings of the AAAI; Walsh, T.; Shah, J.; Kolter, Z., Eds. AAAI Press, 2025, pp. 2564–2572. [CrossRef]
Kamath, A.; Chang, K.; Krishna, R.; Zettlemoyer, L.; Hu, Y.; Ghazvininejad, M. GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation. CoRR 2025, abs/2512.16853, [2512.16853]. [CrossRef]
Nichol, A.Q.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the ICML; Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvári, C.; Niu, G.; Sabato, S., Eds. PMLR, 2022, Proceedings of Machine Learning Research, pp. 16784–16804.
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, S.K.S.; Lopes, R.G.; Ayan, B.K.; Salimans, T.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Proceedings of the NeurIPS; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; Catanzaro, B.; et al. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. CoRR 2022, abs/2211.01324, [2211.01324]. [CrossRef]
Brack, M.; Friedrich, F.; Hintersdorf, D.; Struppek, L.; Schramowski, P.; Kersting, K. SEGA: Instructing Diffusion using Semantic Dimensions. CoRR 2023, abs/2301.12247, [2301.12247]. [CrossRef]
Lee, J.; Lee, J.; Lee, J. CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation. CoRR 2025, abs/2508.10710, [2508.10710]. [CrossRef]
Wang, Q.; Deng, H.; Qi, Y.; Li, D.; Song, Y. SketchKnitter: Vectorized Sketch Generation with Diffusion Models. In Proceedings of the ICLR. OpenReview.net, 2023.
Inoue, N.; Kikuchi, K.; Simo-Serra, E.; Otani, M.; Yamaguchi, K. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. In Proceedings of the CVPR. IEEE, 2023, pp. 10167–10176. [CrossRef]
Weng, H.; Huang, D.; Qiao, Y.; Hu, Z.; Lin, C.; Zhang, T.; Chen, C.L.P. Desigen: A Pipeline for Controllable Design Template Generation. In Proceedings of the CVPR. IEEE, 2024, pp. 12721–12732. [CrossRef]
Zhang, J.; Guo, J.; Sun, S.; Lou, J.; Zhang, D. LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models. In Proceedings of the ICCV. IEEE, 2023, pp. 7192–7202. [CrossRef]
Hui, M.; Zhang, Z.; Zhang, X.; Xie, W.; Wang, Y.; Lu, Y. Unifying Layout Generation with a Decoupled Diffusion Model. In Proceedings of the CVPR. IEEE, 2023, pp. 1942–1951. [CrossRef]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.; et al. Grounded Language-Image Pre-training. In Proceedings of the CVPR. IEEE, 2022, pp. 10955–10965. [CrossRef]
Tumanyan, N.; Geyer, M.; Bagon, S.; Dekel, T. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the CVPR. IEEE, 2023, pp. 1921–1930. [CrossRef]
Zhang, Z.; Han, L.; Ghosh, A.; Metaxas, D.N.; Ren, J. SINE: SINgle Image Editing with Text-to-Image Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 6027–6037. [CrossRef]
Goel, V.; Peruzzo, E.; Jiang, Y.; Xu, D.; Sebe, N.; Darrell, T.; Wang, Z.; Shi, H. PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models. CoRR 2023, abs/2303.17546, [2303.17546]. [CrossRef]
Huang, Y.; Xie, L.; Wang, X.; Yuan, Z.; Cun, X.; Ge, Y.; Zhou, J.; Dong, C.; Huang, R.; Zhang, R.; et al. SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models. In Proceedings of the CVPR. IEEE, 2024, pp. 8362–8371. [CrossRef]
Deutch, G.; Gal, R.; Garibi, D.; Patashnik, O.; Cohen-Or, D. TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models. In Proceedings of the SIGGRAPH Asia; Igarashi, T.; Shamir, A.; Zhang, H.R., Eds. ACM, 2024, pp. 41:1–41:12. [CrossRef]
Wei, T.; Zhou, Y.; Chen, D.; Pan, X. FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing. CoRR 2025, abs/2503.16153, [2503.16153]. [CrossRef]
Shi, Y.; Xue, C.; Liew, J.H.; Pan, J.; Yan, H.; Zhang, W.; Tan, V.Y.F.; Bai, S. DragDiffusion: Harnessing Diffusion Models for Interactive Point-Based Image Editing. In Proceedings of the CVPR. IEEE, 2024, pp. 8839–8849. [CrossRef]
Shi, Y.; Liew, J.H.; Yan, H.; Tan, V.Y.F.; Feng, J. InstaDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos. CoRR 2024, abs/2405.13722, [2405.13722]. [CrossRef]
Liu, C.; Li, R.; Zhang, K.; Lan, Y.; Liu, D. StableV2V: Stablizing Shape Consistency in Video-to-Video Editing. CoRR 2024, abs/2411.11045, [2411.11045]. [CrossRef]
Patel, M.; Gokhale, T.; Baral, C.; Yang, Y. ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models. In Proceedings of the AAAI; Wooldridge, M.J.; Dy, J.G.; Natarajan, S., Eds. AAAI Press, 2024, pp. 14554–14562. [CrossRef]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the ACL; Gurevych, I.; Miyao, Y., Eds. Association for Computational Linguistics, 2018, pp. 2556–2565. [CrossRef]
Chen, S.; Lai, J.; Gao, J.; Ye, T.; Chen, H.; Shi, H.; Shao, S.; Lin, Y.; Fei, S.; Xing, Z.; et al. PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework. CoRR 2025, abs/2506.10741, [2506.10741]. [CrossRef]
Zhang, Z.; Cheng, Y.; Hong, D.; Yang, M.; Shi, G.; Ma, L.; Zhang, H.; Shao, J.; Wu, X. CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation. CoRR 2025, abs/2506.10890, [2506.10890]. [CrossRef]
Gao, Y.; Lin, Z.; Liu, C.; Zhou, M.; Ge, T.; Zheng, B.; Xie, H. PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 8083–8093. [CrossRef]
Zhu, L.; Yang, D.; Zhu, T.; Reda, F.; Chan, W.; Saharia, C.; Norouzi, M.; Kemelmacher-Shlizerman, I. TryOnDiffusion: A Tale of Two UNets. In Proceedings of the CVPR. IEEE, 2023, pp. 4606–4615. [CrossRef]
Kim, J.; Gu, G.; Park, M.; Park, S.; Choo, J. StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On. CoRR 2023, abs/2312.01725, [2312.01725]. [CrossRef]
Li, X.; Sun, Q.; Zhang, P.; Ye, F.; Liao, Z.; Feng, W.; Zhao, S.; He, Q. AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 23723–23733. [CrossRef]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In Proceedings of the ICLR. OpenReview.net, 2023.
Tewel, Y.; Gal, R.; Chechik, G.; Atzmon, Y. Key-Locked Rank One Editing for Text-to-Image Personalization. In Proceedings of the SIGGRAPH; Brunvand, E.; Sheffer, A.; Wimmer, M., Eds. ACM, 2023, pp. 12:1–12:11. [CrossRef]
Huang, Z.; Wu, T.; Jiang, Y.; Chan, K.C.K.; Liu, Z. ReVersion: Diffusion-Based Relation Inversion from Images. In Proceedings of the SIGGRAPH Asia; Igarashi, T.; Shamir, A.; Zhang, H.R., Eds. ACM, 2024, pp. 4:1–4:11. [CrossRef]
Gu, J.; Wang, Y.; Zhao, N.; Fu, T.; Xiong, W.; Liu, Q.; Zhang, Z.; Zhang, H.; Zhang, J.; Jung, H.; et al. PHOTOSWAP: Personalized Subject Swapping in Images. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
Cai, S.; Chan, E.R.; Zhang, Y.; Guibas, L.J.; Wu, J.; Wetzstein, G. Diffusion Self-Distillation for Zero-Shot Customized Image Generation. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 18434–18443. [CrossRef]
Pan, X.; Dong, L.; Huang, S.; Peng, Z.; Chen, W.; Wei, F. Kosmos-G: Generating Images in Context with Multimodal Large Language Models. In Proceedings of the ICLR. OpenReview.net, 2024.
Mou, C.; Wu, Y.; Wu, W.; Guo, Z.; Zhang, P.; Cheng, Y.; Luo, Y.; Ding, F.; Zhang, S.; Li, X.; et al. DreamO: A Unified Framework for Image Customization. In Proceedings of the SIGGRAPH Asia; Komura, T.; Wimmer, M.; Fu, H., Eds. ACM, 2025, pp. 194:1–194:12. [CrossRef]
Zong, Z.; Jiang, D.; Ma, B.; Song, G.; Shao, H.; Shen, D.; Liu, Y.; Li, H. EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM. In Proceedings of the ICML; Singh, A.; Fazel, M.; Hsu, D.; Lacoste-Julien, S.; Berkenkamp, F.; Maharaj, T.; Wagstaff, K.; Zhu, J., Eds. PMLR / OpenReview.net, 2025, Proceedings of Machine Learning Research.
He, Q.; Yao, A. Conceptrol: Concept Control of Zero-shot Personalized Image Generation. CoRR 2025, abs/2503.06568, [2503.06568]. [CrossRef]
Shi, R.; Chen, H.; Zhang, Z.; Liu, M.; Xu, C.; Wei, X.; Chen, L.; Zeng, C.; Su, H. Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model. CoRR 2023, abs/2310.15110, [2310.15110]. [CrossRef]
Cheng, T.Y.; Gadelha, M.; Groueix, T.; Fisher, M.; Mech, R.; Markham, A.; Trigoni, N. Learning Continuous 3D Words for Text-to-Image Generation. In Proceedings of the CVPR. IEEE, 2024, pp. 6753–6762. [CrossRef]
Burgess, J.; Wang, K.; Yeung, S. Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models. CoRR 2023, abs/2309.07986, [2309.07986]. [CrossRef]
Kumari, N.; Su, G.; Zhang, R.; Park, T.; Shechtman, E.; Zhu, J. Customizing Text-to-Image Diffusion with Camera Viewpoint Control. CoRR 2024, abs/2404.12333, [2404.12333]. [CrossRef]
Deitke, M.; Schwenk, D.; Salvador, J.; Weihs, L.; Michel, O.; VanderBilt, E.; Schmidt, L.; Ehsani, K.; Kembhavi, A.; Farhadi, A. Objaverse: A Universe of Annotated 3D Objects. In Proceedings of the CVPR. IEEE, 2023, pp. 13142–13153. [CrossRef]
Shen, X.; Elhoseiny, M. StoryGPT-V: Large Language Models as Consistent Story Visualizers. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 13273–13283. [CrossRef]
Wu, J.Z.; Ge, Y.; Wang, X.; Lei, S.W.; Gu, Y.; Shi, Y.; Hsu, W.; Shan, Y.; Qie, X.; Shou, M.Z. Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. In Proceedings of the ICCV. IEEE, 2023, pp. 7589–7599. [CrossRef]
Khachatryan, L.; Movsisyan, A.; Tadevosyan, V.; Henschel, R.; Wang, Z.; Navasardyan, S.; Shi, H. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators. In Proceedings of the ICCV. IEEE, 2023, pp. 15908–15918. [CrossRef]
Wang, J.; Yuan, H.; Chen, D.; Zhang, Y.; Wang, X.; Zhang, S. ModelScope Text-to-Video Technical Report. CoRR 2023, abs/2308.06571, [2308.06571]. [CrossRef]
Singer, U.; Polyak, A.; Hayes, T.; Yin, X.; An, J.; Zhang, S.; Hu, Q.; Yang, H.; Ashual, O.; Gafni, O.; et al. Make-A-Video: Text-to-Video Generation without Text-Video Data. In Proceedings of the ICLR. OpenReview.net, 2023.
Qi, T.; Yuan, J.; Feng, W.; Fang, S.; Liu, J.; Zhou, S.; He, Q.; Xie, H.; Zhang, Y. Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation. CoRR 2025, abs/2503.19881, [2503.19881]. [CrossRef]
Cai, M.; Cun, X.; Li, X.; Liu, W.; Zhang, Z.; Zhang, Y.; Shan, Y.; Yue, X. DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 7763–7772. [CrossRef]
Wu, J.; Li, X.; Zeng, Y.; Zhang, J.; Zhou, Q.; Li, Y.; Tong, Y.; Chen, K. MotionBooth: Motion-Aware Customized Text-to-Video Generation. In Proceedings of the NeurIPS; Globersons, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.M.; Zhang, C., Eds., 2024.
He, Y.; Xia, M.; Chen, H.; Cun, X.; Gong, Y.; Xing, J.; Zhang, Y.; Wang, X.; Weng, C.; Shan, Y.; et al. Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation. CoRR 2023, abs/2307.06940, [2307.06940]. [CrossRef]
Ding, H.; Liu, C.; He, S.; Jiang, X.; Loy, C.C. MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions. In Proceedings of the ICCV. IEEE, 2023, pp. 2694–2703. [CrossRef]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking Objects as Points. In Proceedings of the ECCV; Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J., Eds. Springer, 2020, Lecture Notes in Computer Science, pp. 474–490. [CrossRef]
Li, X.; Yuan, H.; Zhang, W.; Cheng, G.; Pang, J.; Loy, C.C. Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation. CoRR 2023, abs/2303.12782, [2303.12782]. [CrossRef]
Ding, H.; Liu, C.; He, S.; Jiang, X.; Torr, P.H.S.; Bai, S. MOSE: A New Dataset for Video Object Segmentation in Complex Scenes. In Proceedings of the ICCV. IEEE, 2023, pp. 20167–20177. [CrossRef]
Wu, X.; Hao, Y.; Sun, K.; Chen, Y.; Zhu, F.; Zhao, R.; Li, H. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. CoRR 2023, abs/2306.09341, [2306.09341]. [CrossRef]
Ma, Y.; Shui, Y.; Wu, X.; Sun, K.; Li, H. HPSv3: Towards Wide-Spectrum Human Preference Score. CoRR 2025, abs/2508.03789, [2508.03789]. [CrossRef]
Li, J.; Feng, W.; Chen, W.; Wang, W.Y. Reward Guided Latent Consistency Distillation. Trans. Mach. Learn. Res. 2024, 2024.
Luo, Y.; Hu, T.; Luo, W.; Tang, J. TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward. CoRR 2026, abs/2603.07700, [2603.07700]. [CrossRef]
Sabour, A.; Fidler, S.; Kreis, K. Align Your Flow: Scaling Continuous-Time Flow Map Distillation. CoRR 2025, abs/2506.14603, [2506.14603]. [CrossRef]
Guo, X.; Huo, J.; Shi, Z.; Song, Z.; Zhang, J.; Zhao, J. T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation. CoRR 2025, abs/2505.00337, [2505.00337]. [CrossRef]
Wang, Z.; Wei, X.; Li, B.; Guo, Z.; Zhang, J.; Wei, H.; Wang, K.; Zhang, L. VideoVerse: How Far is Your T2V Generator from a World Model? CoRR 2025, abs/2510.08398, [2510.08398]. [CrossRef]
Srivatsan, K.; Shamshad, F.; Naseer, M.; Nandakumar, K. STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models. CoRR 2024, abs/2408.16807, [2408.16807]. [CrossRef]
Lu, K.; Kriplani, N.; Gandikota, R.; Pham, M.; Bau, D.; Hegde, C.; Cohen, N. When Are Concepts Erased From Diffusion Models? CoRR 2025, abs/2505.17013, [2505.17013]. [CrossRef]
Rusanovsky, M.; Malnick, S.; Jevnisek, A.; Fried, O.; Avidan, S. Memories of Forgotten Concepts. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 2966–2975. [CrossRef]
Lee, U.; Kim, J.; Hwang, S. Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection. CoRR 2026, abs/2602.19631, [2602.19631]. [CrossRef]
Lee, B.H.; Lim, S.; Lee, S.; Kang, D.U.; Chun, S.Y. Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate. In Proceedings of the ICLR. OpenReview.net, 2025.
Kim, S.; Jung, S.; Kim, B.; Choi, M.; Shin, J.; Lee, J. Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion. In Proceedings of the ECCV; Leonardis, A.; Ricci, E.; Roth, S.; Russakovsky, O.; Sattler, T.; Varol, G., Eds. Springer, 2024, Lecture Notes in Computer Science, pp. 128–145. [CrossRef]
Bakr, E.M.; Sun, P.; Shen, X.; Khan, F.F.; Li, L.E.; Elhoseiny, M. HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models. In Proceedings of the ICCV. IEEE, 2023, pp. 19984–19996. [CrossRef]
Hu, X.; Wang, R.; Fang, Y.; Fu, B.; Cheng, P.; Yu, G. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment. CoRR 2024, abs/2403.05135, [2403.05135]. [CrossRef]
Lin, Z.; Pathak, D.; Li, B.; Li, J.; Xia, X.; Neubig, G.; Zhang, P.; Ramanan, D. Evaluating Text-to-Visual Generation with Image-to-Text Generation. In Proceedings of the ECCV; Leonardis, A.; Ricci, E.; Roth, S.; Russakovsky, O.; Sattler, T.; Varol, G., Eds. Springer, 2024, Lecture Notes in Computer Science, pp. 366–384. [CrossRef]
Wang, S.; Saharia, C.; Montgomery, C.; Pont-Tuset, J.; Noy, S.; Pellegrini, S.; Onoe, Y.; Laszlo, S.; Fleet, D.J.; Soricut, R.; et al. Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting. In Proceedings of the CVPR. IEEE, 2023, pp. 18359–18369. [CrossRef]
Zhang, K.; Mo, L.; Chen, W.; Sun, H.; Su, Y. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
Huang, Z.; He, Y.; Yu, J.; Zhang, F.; Si, C.; Jiang, Y.; Zhang, Y.; Wu, T.; Jin, Q.; Chanpaisit, N.; et al. VBench: Comprehensive Benchmark Suite for Video Generative Models. In Proceedings of the CVPR. IEEE, 2024, pp. 21807–21818. [CrossRef]
Han, H.; Li, S.; Chen, J.; Yuan, Y.; Wu, Y.; Deng, Y.; Leong, C.T.; Du, H.; Fu, J.; Li, Y.; et al. Video-Bench: Human-Aligned Video Generation Benchmark. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2025, pp. 18858–18868. [CrossRef]
Liu, Y.; Cun, X.; Liu, X.; Wang, X.; Zhang, Y.; Chen, H.; Liu, Y.; Zeng, T.; Chan, R.; Shan, Y. EvalCrafter: Benchmarking and Evaluating Large Video Generation Models. In Proceedings of the CVPR. IEEE, 2024, pp. 22139–22149. [CrossRef]
Liu, Y.; Li, L.; Ren, S.; Gao, R.; Li, S.; Chen, S.; Sun, X.; Hou, L. FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation. In Proceedings of the NeurIPS; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
Zhuang, C.; Huang, A.; Cheng, W.; Wu, J.; Hu, Y.; Liao, J.; Huang, Z.; Wang, H.; Liao, X.; Cai, W.; et al. ViStoryBench: Comprehensive Benchmark Suite for Story Visualization. CoRR 2025, abs/2505.24862, [2505.24862]. [CrossRef]
Dave, A.; Khurana, T.; Tokmakov, P.; Schmid, C.; Ramanan, D. TAO: A Large-Scale Benchmark for Tracking Any Object. In Proceedings of the ECCV; Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J., Eds. Springer, 2020, Lecture Notes in Computer Science, pp. 436–454. [CrossRef]
Miao, J.; Wei, Y.; Wu, Y.; Liang, C.; Li, G.; Yang, Y. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2021, pp. 4133–4143. [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the CVPR. Computer Vision Foundation / IEEE, 2020, pp. 11618–11628. [CrossRef]
Chen, Y.; Zhu, X.; Li, T.; Chen, H.; Shen, C. A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction. CoRR 2025, abs/2502.05503, [2502.05503]. [CrossRef]
Mariam, K.M.M.; Arun, A.; Laskar, Z.; Jawahar, C.V. PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education. CoRR 2026, abs/2601.00943, [2601.00943]. [CrossRef]
Montanaro, A.; Aira, L.S.; Aiello, E.; Valsesia, D.; Magli, E. MotionCraft: Physics-Based Zero-Shot Video Generation. In Proceedings of the NeurIPS; Globersons, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.M.; Zhang, C., Eds., 2024.
Samuel, D.; Tzachor, I.; Levy, M.; Green, M.; Chechik, G.; Ben-Ari, R. Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention. CoRR 2026, abs/2602.01801, [2602.01801]. [CrossRef]
Gandikota, R.; Materzynska, J.; Fiotto-Kaufman, J.; Bau, D. Erasing Concepts from Diffusion Models. In Proceedings of the CVPR. IEEE, 2023, pp. 2426–2436. [CrossRef]
Kumari, N.; Zhang, B.; Wang, S.; Shechtman, E.; Zhang, R.; Zhu, J. Ablating Concepts in Text-to-Image Diffusion Models. In Proceedings of the ICCV. IEEE, 2023, pp. 22634–22645. [CrossRef]
Geyer, M.; Bar-Tal, O.; Bagon, S.; Dekel, T. TokenFlow: Consistent Diffusion Features for Consistent Video Editing. In Proceedings of the ICLR. OpenReview.net, 2024.
Avrahami, O.; Hertz, A.; Vinker, Y.; Arar, M.; Fruchter, S.; Fried, O.; Cohen-Or, D.; Lischinski, D. The Chosen One: Consistent Characters in Text-to-Image Diffusion Models. In Proceedings of the SIGGRAPH; Burbano, A.; Zorin, D.; Jarosz, W., Eds. ACM, 2024, p. 26. [CrossRef]
Chen, H.; Zhang, Y.; Cun, X.; Xia, M.; Wang, X.; Weng, C.; Shan, Y. VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. In Proceedings of the CVPR. IEEE, 2024, pp. 7310–7320. [CrossRef]
Yang, S.; Du, Y.; Ghasemipour, S.K.S.; Tompson, J.; Kaelbling, L.P.; Schuurmans, D.; Abbeel, P. Learning Interactive Real-World Simulators. In Proceedings of the ICLR. OpenReview.net, 2024.

Figure 1. Three consistency relations in diffusion-based generation. External consistency measures agreement with prompts, layouts, references, or editing instructions. Internal consistency measures compatibility among generated parts, views, frames, or instances. Normative consistency measures agreement with human-centered criteria such as preference, safety, and value constraints, and world-centered criteria such as physical, commonsense, or causal plausibility. Typical failures are shown under each relation.

Figure 2. Evaluation protocol for consistency claims. A protocol should specify the observation unit, agreement target, evidence source, and evaluation output. Observation units range from a single image to edited pairs, identity-conditioned sets, multi-view bundles, and video or story sequences. Evidence sources include MLLM/VQA checks, similarity signals, set- or sequence-level diagnostics, learned evaluators, and human or stress-test procedures.

Figure 3. Optimization loci for enforcing consistency. Consistency can be strengthened before sampling, at the condition interface, during denoising, across coupled outputs, or after generation. These loci expose trade-offs among persistence, controllability, realism, diversity, memory cost, and modularity.

Figure 4. Survey taxonomy. The first-level partition follows the three consistency relations: external, internal, and normative consistency. The second-level nodes are aligned with the corresponding subsections in the main text and are hyperlinked to their discussions. Leaf nodes summarize representative methods, benchmarks, and applications. The figure shows primary placement; secondary relations are recorded in the corresponding text when overlap changes mechanisms, evaluation, or trade-offs.

Figure 5. Attend-and-Excite [31] as token-level prompt repair. Cross-attention feedback identifies neglected subject tokens and updates the latent state during denoising.

Figure 6. ControlNet [32] as condition-path optimization. A trainable control branch injects structural conditions into a frozen diffusion backbone during denoising.

Figure 7. DiffEdit [69] as mask-aware edit consistency. The predicted edit mask localizes where the instruction should act while preserving regions outside the target.

Figure 8. DreamBooth [34] and PhotoMaker [52] as two identity-consistency paradigms. DreamBooth stores identity through subject-specific adaptation, whereas PhotoMaker uses reusable identity features for inference-time conditioning.

Figure 9. AnimateDiff [36] and StoryDiffusion [47] for sequence-level consistency. AnimateDiff adds motion modules for local temporal coupling, whereas StoryDiffusion propagates identity and semantic state across longer sequences.

Figure 10. Diffusion-DPO [37] as preference-consistency optimization. Pairwise preference labels directly update the diffusion model, turning human judgments into a normative consistency objective.

Table 1. Positioning of this survey relative to related surveys. Prior surveys usually follow task, modality, or alignment categories, whereas this survey uses consistency relations as the organizing principle and analyzes mechanisms, evaluation protocols, and trade-offs across tasks. In the relation columns, ✓denotes primary coverage, Partial denotes secondary or indirect coverage, and – denotes little direct coverage. For Cross-rel., Limited denotes little analysis of interactions among consistency relations, Partial denotes occasional discussion, and ✓denotes a main organizing goal. These labels are qualitative scope indicators, not quantitative scores.

Survey scope	Refs.	Focus	Ext.	Int.	Norm.	Cross-rel.
Text-to-image / Controllable generation	[16,17,24]	Prompt following, conditioning, and preference-related evaluation	✓	–	Partial	Limited
Editing / Personalization	[8,9,25,26]	Edit preservation, subject binding, and user-adaptive generation	✓	✓	Partial	Limited
Video / Long-form generation	[18,19,27]	Temporal coherence, video synthesis, and narrative continuity	Partial	✓	Partial	Partial
Alignment / Safety	[20,28,29]	Preference, safety, robustness, and trustworthy generation	Partial	–	✓	Partial
3D / 4D / Physical generation	[21,22,23,30]	Geometry, dynamics, physical plausibility, and embodied world modeling	Partial	✓	✓	Partial
This survey	–	Agreement relations, enforcement loci, evaluation protocols, and trade-offs	✓	✓	✓	✓

Table 2. Matrix view of representative external-consistency methods. Short labels identify the main agreement target, enforcement locus, and failure type; the surrounding text explains the mechanisms in detail.

Family	Target	Locus	Failure tag	Anchor methods
Attention repair	Prompt / comp.	Attention	Omission / grounding	Attend-and-Excite [31], SEGA [113]
Spatial guidance	Layout / count	Sampling	Relation / count	BoxDiff [58]
Control adapters	Structure	Condition path	Condition mismatch	ControlNet [32], T2I-Adapter [63]
Grounding modules	Region / reference	Grounding	Anchoring / ref. drift	GLIGEN [48], IP-Adapter [64]
Editing	Instruction / mask	Edit path	Over-editing / drift	DiffEdit [69], InstructPix2Pix [33], InstructDiffusion [70]

Table 3. Matrix view of representative internal-consistency methods. Short labels indicate the stabilized state, coupling route, and drift type; detailed mechanisms are discussed in the text.

Subproblem	State	Coupling route	Drift tag	Anchor methods
Subject identity	Subject ID	Adaptation / ID feature	Face / clothing	DreamBooth [34], PhotoMaker [52], InstantID [76]
Story identity	Character role	Cross-image propagation	Character drift	StoryDiffusion [47], ConsiStory [77]
Multi-view / 3D	Geometry	View coupling	View / geometry	Zero-1-to-3 [78], SyncDreamer [80], MVDream [35]
Text-to-video	Motion / appearance	Temporal module	Flicker / motion	AnimateDiff [36]
Video editing	Edited structure	Frame coupling	Shape wobble	StableV2V [129]
Narrative	Entity / event state	Planning / memory	Forgetting / contradiction	TaleCrafter [87], MovieDreamer [89]

Table 4. Matrix view of representative normative-consistency methods and resources. Short labels separate human-centered and world-centered targets while keeping mechanisms and failure modes readable at a glance.

Type	Target	Evidence	Risk / failure	Anchor methods / resources
Human-centered	Preference / aesthetics	Human reward	Reward / diversity	ImageReward [53], HPSv3 [166], Diffusion-DPO [37]
	Safety / values	Unsafe set + retention	Over-refusal / amnesia	SLD [38], UCE [97], Six-CD [101]
World-centered	Physical plausibility	Physics test	Physical violation	PhyBench [102], VideoPhy-2 [103], T2VPhysBench [170]
	Commonsense / causality	World-state test	Scene / causal break	T2VWorldBench [104], PhyWorldBench [105], VideoVerse [171]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.