Preprint
Article

This version is not peer-reviewed.

From Alignment to Evocation: On the Capability Boundaries and Collaborative Paths of AI Art Creation

Submitted:

21 June 2026

Posted:

23 June 2026

You are already at the latest version

Abstract
Currently AI-generated works in music, literature, painting, and film have drawn widespread attention, yet assessments of its capabilities remain largely at the empirical level. This paper proposes an analytical framework based on neuroaesthetics, distinguishing the essence of artistic creation into two levels: "Alignment" (statistical fitting) and "Evocation" (fragmentation and reorganization). Based on this distinction, it establishes the " Ring Scale" to quantify aesthetic intensity. Within this framework, the paper analyzes the neural foundations of the auditory, visual, and literary pathways and their integrative effects in multimodal comprehensive art (film and television), explains the deeper reasons why AI has achieved breakthroughs first in the auditory domain, and points out that the current core capability boundary of AI lies in the fact that "alignment has reached its extreme, while evocation still faces hierarchical obstacles." On this basis, the paperproposes three levels of strategic shifts—from replacing humans to complementing humans, from exhaustive imagery to probing fragments, and from pursuing ring counts to pursuing fission—and establishes imagination as the core productive force in human-AI collaboration. The paper anchors the key to theoretical implementation in "Prompt Engineering," arguing that its essence is the process by which human creators translate imagination into AI-executable instructions, and proposes three core strategies: physiological arousal description, multimodal simulation prompts, and strategic blank-leaving. This paper also explores the potential positive value of AI's "hallucination" property at the evocation level. Finally, this paper systematically compiles the testable hypotheses proposed throughout the study, designs corresponding validation approaches, and calls for empirical research.
Keywords: 
;  ;  ;  ;  ;  ;  ;  
Subject: 
Arts and Humanities  -   Art

1. Introduction: A Question Demanding Clarification

Between 2025 and 2026, AI-generated content—especially in the domain of music— has stirred strong public response. Notable examples include AI-rendered versions of Jay Chou's "Hair Like Snow" in the style of the Tang dynasty's Anxi Frontier Command, and "Lonely Sandbank Cold" infused with revolutionary-era pathos. The performance, arrangement, and emotional expressiveness of these AI-generated works have produced a palpable sense of being "struck" in vast numbers of listeners. Meanwhile, AI literature and AI painting, despite their respective advances, have not matched the public acceptance and emotional impact of AI music. Since late 2025, AI filmmaking has begun to emerge—from AI-generated short films to AI-assisted full narrative features. Some works have reached considerable levels of accomplishment in visual texture, pacing, and emotional delivery, marking a crucial step from single-modal to multimodal integration in AI creation.
This phenomenon raises a core question: What are the deep-rooted causes behind AI's differential capabilities across artistic domains? Is this a matter of uneven technological development, or does it stem from intrinsic differences among the art forms themselves? Furthermore, if our goal is not merely to "understand AI art creation" but to "advance AI art creation to a higher level," what theoretical framework should we establish, and where should we anchor its practical implementation?
This paper seeks to provide a systematic answer from the perspective of neuroaesthetics.

2. Core Framework: Alignment and Evocation

Artistic creation is inseparable from imagination. Whether in music, painting, literature, or film, the creator first constructs a scenario simulation that has never existed—"seeing" images that have never occurred, "hearing" melodies never before played, "feeling" the physical and mental experiences of fictional characters. This capacity to reassemble fragments into a whole is the prime mover of artistic creation. Once the work is completed and reaches the audience, imagination shifts from the creator to the receiver—the audience fills the gaps left by the work with their own fragments of experience, reconstructing that scenario simulation in their minds, even projecting themselves into it. The life of art is thus realized through this relay of dual imaginations.
Western aesthetic theories offer important insights into the cognitive mechanisms of imagination. Arthur Koestler, in The Act of Creation (1964), proposed the theory of "Bisociation," arguing that the essence of all creative activity—including humor, art, and scientific discovery—are essentially the sudden juxtaposition of two concepts or entities that originally belong to different "matrices of thought." This theory is structurally isomorphic with the "category displacement" and "imagery collision" discussed later in this paper. Colin Martindale, in The Clockwork Muse (1992), proposed the "habituation-dishabituation" theory, contending that the intrinsic driving force of artistic creation is the pursuit of "novelty" and "arousal potential," compelling artists to continually break existing paradigms. AI hallucination, as a product of random deviation from training distributions, constitutes precisely a natural dishabituation mechanism.

2.1. Alignment: Aesthetics as Statistical Fitting

"Alignment" refers to the successful match between the imagery and emotional paradigms output by an artwork and the audience's existing experience repositorire. The audience's response is: "This is accurate," "This is right," "This is how it is," "This sounds good." This is the foundational layer of aesthetic experience. Its neural basis lies in external stimuli successfully activating stored and labeled paradigm templates in the audience's brain. When the prediction error between stimulus and template falls within the optimal range, pleasure arises.
The technical essence of alignment is statistical fitting. Supported by big data, AI can reverse-engineer from vast repositories of human works "what kinds of imagery combinations and acoustic parameter combinations elicit what kinds of audience responses," and then perform optimal sampling from this mapping table. Alignment is the domain in which AI currently excels, particularly in music.

2.2. Evocation: Aesthetic Experience at the Fragment Recombination Level

"Evocation" refers to an artwork's imagery combinations successfully activating latent but unintegrated experiential fragments in the audience's brain, and fostering new connections among these fragments during activation. The audience's response is not "I recognize this," but "I've been struck," "It has spoken what I could never articulate." This is the higher-order layer of aesthetic experience—the Yijing (aesthetic resonance) experience.
The neural basis of "evocation" involves a key concept: the latent substratum of Yijing.In every person's brain exists a latent memory network constructed from a lifetime of multimodal embodied experience. These memory fragments—the coupling between a crescent moon glimpsed at dusk and the slight upward angle of the neck; the co-occurrence of a choked throat at parting and the icy touch of raindrops—normally reside in a sub-threshold state, never individually labeled, never linguistically named, never actively accessed by consciousness. They are scattered across various corners of multimodal memory, in a state of "existing but not yet illuminated," awaiting awakening.
When an artwork simultaneously illuminates these fragments—not by providing new information, but by establishing new connections—pathways between fragments are opened, and an unprecedented holistic experience emerges. This is Yijing. It depends on embodied experience, sub-threshold storage, and imagination—the power to heat these fragments to ignition point simultaneously.

2.3. The Relationship Between Alignment and Evocation

Alignment is the baseline—without it, the audience cannot enter the aesthetic channel.
Evocation, in contrast, exists on a spectrum of depth: preliminary evocation (B+ level) activates only a small number of fragments, producing a mild resonance of "feeling something"; large-scale evocation (A level) synchronously activates fragments across multiple brain regions, producing the intense experience of "being struck"; extreme evocation, approaching cognitive restructuring (A+ level) triggers large-scale synchronization between the default mode network and the task-positive network, temporarily dissolving self-boundaries and producing sustained cognitive change after the experience.
AI art creation currently occupies the following position: alignment has reached its zenith; evocation can achieve preliminary levels (approximately B+) through statistical fitting, but faces insurmountable barriers at higher levels (A and above).

3. Neural Pathways: The Neurological Basis of Three Art Forms

3.1. The Auditory-Emotional Direct Pathway (Music)

Auditory signals pass through the cochlear nucleus and inferior colliculus, with a portion projecting directly to the amygdala. This is the shortest path—sound can trigger physiological emotional responses before the cerebral cortex has fully analyzed its meaning. The emotional impact of music is, in essence, a rapid hijacking of the amygdala.
This pathway is characterized by an extremely high degree of parameterizability. Tempo variations, harmonic tension and resolution, timbral brightness and darkness, volume envelopes—these are all variables that can be precisely quantified and exhaustively enumerated. AI possesses an inherent advantage in this pathway: it can search for optimal emotion-triggering combinations within vast acoustic parameter spaces, achieving precision unattainable by human creators.

3.2. The Visual-Semantic Construction Pathway (Painting)

Visual signals pass through the retina and lateral geniculate nucleus, primarily entering the occipital visual cortex, then proceeding along the "ventral pathway" for object recognition and meaning construction. This pathway is longer and slower than the auditory pathway, and heavily depends on the "world model" provided by the frontal and parietal lobes—three-dimensional spatial understanding, tactile expectations of materials, physical constraints of light and shadow. The visual system has been trained over the long course of evolution into an exquisitely precise "reality detector."
This pathway is characterized by being simultaneously constrained by two thresholds: "reality detection" and "Yijing evocation." AI has dual shortcomings: it lacks a 3D world model based on embodied experience, and it lacks embodied memory.

3.3. The Multimodal-Scenario Simulation Pathway (Literature)

Reading text is itself a visual act, but what literature evokes is not visual aesthetics alone—it is multimodal scenario simulation. Consider Han Yu's "Snow blocks the Blue Pass, the horse cannot advance" from his poem "Written on My Way into Exile at Languan Pass for My Grandnephew": these seven Chinese characters activate far more than the mere visual images of "snow" and "horse." They activate somatic sensations (heavy snow blocking the path, bone-piercing cold), spatial awareness (the treacherous Blue Pass, the vast road ahead), and emotion (the frustration and unwillingness of the exiled official), and also draw upon historical knowledge (Han Yu's offense against Emperor Xianzong for opposing the veneration of the Buddha's relic, resulting in his banishment to Chaozhou), as well as a deeper existential resonance—that sense of being blocked, of having a road yet being sealed off by snow, of having a horse yet being unable to advance an inch, a predicament that nearly every person has encountered in some form in their own life. It is precisely this multimodal, multi-layered synchronous activation that enables these seven characters to "strike" readers across a millennium.
A similar mechanism exists in Western Imagist poetry. Take Ezra Pound's "In a Station of the Metro" (1913):
"The apparition of these faces in the crowd; / Petals on a wet, black bough."
Pound uses just two lines to juxtapose the seemingly disparate imagery of "faces in a metro station crowd" and "petals on a wet, black bough," omitting all logical transition words to produce a powerful visual impact and aesthetic resonance. Compared with Han Yu's line, Pound's lines rely on the same mechanism of direct imagery juxtaposition to create aesthetic effect, the difference being that the former draws upon deeper layers of historical and cultural background knowledge, while the latter emphasizes the capture of a visual moment.
This pathway is characterized by the highest threshold. It demands that the artwork successfully activate fragments across multiple modalities simultaneously in the audience, and establish new connections among these fragments. This is precisely the domain where imagination plays its central role—the creator must first construct the multimodal scenario simulation in mind, then compress it into text; the reader must then use their own imagination to re-expand the text into a scenario simulation. The imagination of both parties is indispensable.

3.4. The Unique Position of Multimodal Integrated Art (Film)

Film is the integration field of the three aforementioned pathways. A complete AI film work must simultaneously handle screenplay (literature pathway), visuals (visual pathway), and score with sound effects (auditory pathway), while achieving precise synchronization of all three on the temporal axis. This makes film the ultimate testing ground for AI creative capability—it is not the simple sum of three capacities, but their product. The weakness of any single pathway is amplified by the other two.
The unique challenge of film lies in this: it requires not only alignment but cross-modal coherent evocation. When visuals, music, and dialogue converge on the same emotion at the same moment, they produce a multiplier effect. Conversely, if the visuals are evoking while the music is merely aligning, or the music is evoking while the dialogue is merely narrating, the inconsistency becomes a "fissure," and the audience feels "pulled out" of the experience.
Take the opening sequence of Denis Villeneuve's Dune (2021) as an example: Hans Zimmer's low-frequency score (auditory pathway) and the vast desert visuals (visual pathway) synchronously evoke the audience's sense of awe and isolation at the same moment. Zimmer invented unconventional instruments specifically for Dune, creating "alien sounds" that belong to no known musical tradition—a classic case of a human creator using "technical means" to produce "defamiliarization effects." If AI were to score the same visuals, it would likely match "desert—vastness—orchestral music" from its training data as a statistical combination, but would be unable to "invent new sounds" by "inventing new instruments" as Zimmer did. This precisely confirms this paper's core argument: AI can exhaust existing paradigms at the "alignment" level, but at the "evocation" level, the "sounds that have never existed" created by human creators through embodied experience and imagination remain beyond AI's autonomous generation.
Since late 2025, several landmark AI film cases have emerged. Yet the most successful AI film works to date still rely on deep human involvement at every stage: story structure design, positioning of key emotional plot points, and control of the rhythm of multimodal "simultaneous impact." These stages require not single-point optimization but holistic imagination—a global perceptiveness that can foresee multimodal synergistic effects at the very inception of creation. Imagination in film creation is not an auxiliary tool; it is the core productive force.

4. The Ring Scale: A Quantification Dimension for Aesthetic Intensity

Based on the foregoing framework, this paper proposes the "Ring Scale" to quantify aesthetic impact and striking intensity.
Ring Experience Description Neural Correspondence Mechanism Attribute Evocation Level
Miss No response, even aversion Prediction error too large or too small Alignment failure
1-3 Rings Recognition ("It's okay," "Listenable") Familiar paradigm matched, no new connections Pure alignment
4-6 Rings Touched ("Good," "Interesting") Local fragments activated, no cross-regional integration Alignment dominant B+: Preliminary evocation
7-8 Rings Struck ("Hit," "Goosebumps," "Tears welling") Multimodal fragments synchronously activated, Yijing emerges Evocation dominant A: Large-scale evocation
9 Rings Transformation ("Brain reorganized") Large-scale neural synchronization, self-boundary dissolution, sustained cognitive restructuring Deep evocation A+: Extreme evocation
10 Rings Ineffable Rare peak experience, loss of time sense, dissolution of self Limit of evocation A+: Extreme evocation
1 Tables may have a footer.
The theoretical significance of this scale lies in its capacity to distinguish "sounding good" from "being struck," and to further distinguish "being struck momentarily" from "being permanently changed." Below 7 rings is essentially the optimal combination of imagery—what AI has already achieved or is achieving. Rings 7-8 represent the emergence of Yijing; ring 9 and above represent cognitive restructuring. AI can already reliably output 7-8 ring works in music, but remains stalledaround 4-6 rings in multimodal integration domains such as literature and film. Multimodal coherent evocation requires holistic imagination—which AI currently lacks, but which can be supplemented through human-machine collaboration.

5. The Capability Boundaries of AI

The foundation of current AI creation is statistical learning. AI reverse-engineers from vast repositories of human works "what kinds of imagery combinations and acoustic parameter combinations elicit what kinds of audience responses," then performs sampling from this mapping table. Its essence is reverse engineering: deconstructing from Yijing to imagery, then reassembling from imagery.
But Yijing is not the sum of imagery; it is the new connection that emerges when multimodal fragments are simultaneously activated. What is reassembled from imagery possesses only imagery, not Yijing. This is because Yijing is already decomposed at the very moment of reverse engineering.
Taking music as an example: at the "alignment" level, AI can approach or even surpass human creators indefinitely; at the preliminary level of "evocation" (B+), AI can already achieve results through precise emotional parameter control; but at higher levels of evocation (A and above), it faces a chasm that is not purely technical—it lacks embodied experience and a latent evocative reservoir, and therefore cannot autonomously generate the power that simultaneously illuminates hidden fragments.
The weight of imagination becomes prominent here. If alignment depends on data and computing power, then evocation depends on imagination as the core productive force—the capacity to construct multimodal scenario simulations from nothing, to reassemble scattered fragments into a whole, to perceive possibilities in blank spaces. AI currently does not possess imagination in this sense. Its "generation" is the recombination of existing data, not the discovery of light that has not yet existed from within darkness. Therefore, in human-machine collaboration, the provider of imagination must be the human. The machine's role is that of an amplifier, not a light source.
It should be noted that the above judgment concerning AI's capability boundary—particularly the core claim that "AI lacks embodied experience and a latent evocative reservoir, and therefore cannot autonomously generate high-level evocation"—remains, at present, a hypothesis based on available evidence, not a verified conclusion. Its falsifiability lies in this: if someone were to construct an AI system with complete embodied experience (for example, through long-term accumulation of multimodal interaction data via physical robots), and the works of that system achieved evocation effects equivalent to those of human masters in strictly controlled experiments, then this hypothesis would be overturned. This is precisely the empirical direction this paper encourages.

6. From Description to Guidance: Three Strategic Shifts

6.1. Shift One: From Substituting Humans to Complementing Humans

AI's most effective positioning is as a "complementor." It excels at exhaustively enumerating optimal parameter combinations, but does not excel at injecting that which cannot be parameterized. The core division of labor in human-machine collaboration should be: the human is responsible for providing imagination—constructing a complete, multimodal scenario simulation in the mind; AI is responsible for exhaustively enumerating the optimal expressive scheme for this simulation within parameter space. For music, the human can set the emotional core and structural framework, while AI completes the arrangement and acoustic parameter optimization. For literature, the human can provide the multimodal details of the scenario simulation, while AI completes the language organization and imagery enumeration. For painting, the human can provide the Yijing direction and visual fragments, while AI completes the visual enumeration and stylistic adaptation. For film, the human is responsible for the holistic imagination of story structure and key plot points, while AI is responsible for shot enumeration, score optimization, and multimodal synchronization calibration.

6.2. Shift Two: From Exhausting Known Imagery to Detecting Individual Fragments

The current creative path of AI is "from imagery to imagery"—perpetually circling within the existing imagery pool. True evocation requires reaching the audience's latent evocative reservoir—those fragments that have never been labeled, that even the audience themselves do not know exist. This demands that AI move from "group statistics" to "individual detection." Concretely, a "fragment detection protocol" could be developed: allowing AI, through multiple rounds of interaction with a single user, to progressively learn that particular user's fragment distribution. A "Yijing feedback loop" could also be established: AI generates candidate works, the human provides feedback on whether they were "struck" or "not struck," AI adjusts parameters based on this feedback, and after sufficient iterative rounds, converges toward the user's unique fragment map.

6.3. Shift Three: From Pursuing Ring Scores to Pursuing Transformation

The current evaluation system is "ring score-oriented"—using audience size, likes, and dissemination metrics as standards. This is alignment logic. But great works above 9 rings are not produced through "matching"; they are produced through "transformation"—they often strike only a few people initially, then spread outward from those few. Therefore, the evaluation standard for AI creation should shift from "how many people think it's good" to "how many people have been changed by it." AI's advantage lies precisely in its ability to simultaneously generate countless niche versions, tailor-made for every unique fragment combination.

7. Grounding in Practice: Imagination-Driven Prompt Engineering

The three strategic shifts outlined above must ultimately be grounded at a concrete operational level: prompt engineering. The essence of prompt engineering is the externalization and translation of imagination. The human creator constructs a complete multimodal scenario simulation in the mind, but this simulation cannot be directly transmitted to AI. The task of the prompt is to decompose this internal simulation into linguistic instructions that AI can understand and execute. This translation process inevitably incurs semantic loss; the goal of prompt engineering is to minimize this loss as much as possible, translating human imagination into machine-executable parameter space search paths as completely as possible.
It must be declared: the three strategies proposed below—physiological arousal description, multimodal simulation prompts, and strategic omission—are operational hypotheses derived from the foregoing neuroaesthetic framework. They have logical grounding, but have not yet been tested through rigorous controlled experiments. The aim of this section is not to claim that these strategies "have been proven effective," but to distill them as testable hypotheses, providing a starting point for subsequent empirical research.

7.1. Strategy One: From "Emotion Labels" to "Physiological Arousal Descriptions"

Emotion labels ("tragic heroism," "desolation") are high-level semantic concepts; AI processes them at the statistical level. Physiological arousal descriptions bypass semantic labels and directly access the physiological foundations of emotion. They do not tell AI "make this passage tragic and heroic"; they tell AI "make this melody cause the listener's chest to tighten, as if something is blocked there, wanting to shout but unable to make a sound, tear ducts activating." This strategy leverages the auditory-emotional direct pathway—activating the amygdala directly through acoustic parameters.
Operational essentials: Translate emotional objectives into concrete descriptions of physiological experience. Not "the climax should feel powerful," but "when the drum kicks in at the chorus, it should feel like a blunt punch to the back, making the body lurch forward involuntarily." The more specific the physiological details, the more precise AI's matching of acoustic parameters.

7.2. Strategy Two: From "Imagery Accumulation" to "Multimodal Simulation Prompts"

Imagery accumulation can only activate the visual labels in AI's database; it cannot activate multimodal simulation—without the participation of touch, taste, and visceral sensation, a complete scenario simulation cannot be constructed in the audience's mind. Multimodal simulation prompts require the creator to first "live" within that scene in their mind—to feel the direction, temperature, and scent of the wind, to feel the body's posture and the tension of the muscles—and then translate these sensations into concrete linguistic instructions. Not "write about an old soldier watching the sunset from the city wall," but "imagine an old soldier, his back against the battlement. The wind is dry, chapping his lips; he tastes blood and sand. What he holds in his hand is not a sword, but a pouch his wife handed him before he left for war, its original color no longer discernible. What the lyrics should describe is not the setting sun, not the city wall, but the texture of that pouch in his palm, and the weight of this silence pressing down upon him."
Operational essentials: Provide AI with embodied details that span multiple senses—touch, taste, visceral sensation, kinesthetic sensation. In film creation, shot prompts must simultaneously cover the visual texture of the image, the auditory texture of the sound effects, and the character's bodily sensations at that moment; all three are indispensable.

7.3. Strategy Three: Strategic Omission—From "Pursuing Perfection" to "Preserving Fissures"

AI's default tendency is to generate "perfect" output. But alignment-level perfection may precisely block the "fissures" that evocation requires. When everything is filled in, the audience has no space to inject their own experiential fragments, and Yijing cannot emerge.
This strategy directly aligns with the principle of Liubai(blank-leaving) in traditional Chinese aesthetics. Liubai is not absence; it is invitation. It is not deficiency; it is passage. In Chinese painting and calligraphy, the blank spaces are not "unpainted"—the artist paints by not painting, allowing the viewer to paint for themselves. The "one-corner half-side" compositions of Ma Yuan and Xia Gui, the withered branches and solitary birds of Bada Shanren—all use minimal objects to force the viewer's imagination into the blank spaces, making the blanks the generative field of Yijing. In Western aesthetic theory, Wolfgang Iser's concept of "gaps" (Leerstellen) in The Act of Reading (1978) converges on the same principle from a different path: the blanks and indeterminacies in the text summon the reader to participate in the production of meaning, and the process of the reader's imagination filling these gaps is precisely where aesthetic experience arises. In music, the rest is not the interruption of sound, but the space where sound continues to resonate within the audience. In literature, Hemingway's "iceberg theory"—showing only one-eighth, leaving seven-eighths underwater—is precisely the aesthetic practice of Liubai. In Western painting, Francis Bacon's Study after Velázquez's Portrait of Pope Innocent X renders the Pope's face in blurred, dragged brushstrokes into a distorted form, using "deliberate incompletion" to leave an immense psychological imaginative space for the viewer—this, too, is the practice of Liubai.
Operational essentials: Explicitly instruct AI in the prompt to create controlled imperfection or blank spaces—adding approximately 20% loss of control in the final chorus, the breath allowed to falter intermittently but never truly break; the closing electric guitar solo should not end too cleanly, allowing the final note to slowly dissipate in feedback; a moment of silence before a key line in film, an unexpected extension of an empty shot. The key parameter is the "degree" of Liubai—too little produces no effect; too much becomes genuine defect. This requires the human creator, relying on imagination and intuition, to judge where to fill and where to let go.

8. AI Hallucination: A Potential Catalyst for Imagination

In the preceding discussion, AI "hallucination"—the model's deviation from facts during generation, fabricating non-existent information—has been tacitly assumed to be a flaw requiring suppression. But in the domain of artistic creation, this attribute may warrant reassessment.
The essence of AI hallucination is a deviation from the training distribution occurring during statistical fitting. The imagery combinations it generates lie outside the standard mapping table of humanity's existing experience repository—though "inaccurate," this very deviation may serve as the source of novel imagery connections. Human artists often employ similar mechanisms in creation: the automatic writing of Surrealism, Dalí's paranoiac-critical method, the unexpected modulations in jazz improvisation—all actively manufacture "controlled hallucinations" to derail consciousness from habitual associative paths, thereby tapping into uncharted neural pathways.Artists have paid a considerable price to enter this state of "temporary withdrawal of the logical editing layer." According to a survey of 20 industries by the U.S. Substance Abuse and Mental Health Services Administration (SAMHSA), the arts and entertainment industry ranks second in drug use and fourth in heavy drinking. In the history of Western modern art, Abstract Expressionist painters Jackson Pollock and Mark Rothko both struggled with long-term alcoholism; their "drip painting" technique and color field painting, respectively, carried an automatist quality of "disinhibition" in the creative process itself, yet both ultimately paid the price of their health, and even their lives, for it.
If we regard AI hallucination as a random, paradigm-unconstrained imagery generation mechanism, it may offer human creators the following values:
First, as a trigger source for imagination. Human imagination, however powerful, remains constrained by the boundaries of personal experience and habitual associative paths. The unconventional imagery combinations produced by AI hallucination may precisely strike fragment connections that the creator themselves has never accessed, thereby triggering the creator's secondary imagination—not AI completing the act of imagination, but AI's unexpected output becoming a springboard for human imagination.
Second, as an accelerator of "evocation." Alignment-level output is "expected"; it brings comfort but not shock. AI hallucination produces output that is "unexpected"; it may, after a brief moment of bewilderment, forge an entirely new connection in the audience—an experience structurally similar to the neural process of "brain reorganization" in high-level evocation (A level): in both cases, existing connections are broken and new connections established.
Third, as an enhancement mechanism for blank-leaving. The imagery combinations produced by AI hallucination often carry a hazy, uncertain, even slightly illogical quality. This quality can precisely manufacture strategic blank-leaving for the work—not fissures deliberately designed by the human creator, but unexpected spaces "spilling over" from AI itself. When facing these unexpected spaces, the degree of imaginative engagement from the audience may be higher than when facing carefully designed blank-leaving.
A strict qualification is necessary: this positive value of AI hallucination can only be realized under the supervision and selection of the human creator. Unscreened AI hallucination is merely error. Only when the human creator uses their own imagination to identify, select, and secondarily process those "promising errors" can AI hallucination potentially transform from flaw into catalyst. It does not replace human imagination; rather, it provides an additional perturbation source for human imagination—a perturbation source that may knock imagination off its habitual track and into new territory.

9. Conclusion

This paper has established the dual framework of "Alignment and Evocation" and the quantification dimension of the "Ring Scale" from the perspective of neuroaesthetics, systematically analyzing the deep-rooted causes behind AI's divergent capabilities across the auditory, visual, and literary pathways, as well as in multimodal film integration. The core judgment of this paper is: AI has reached its zenith at the "alignment" level; at the "evocation" level, it can achieve preliminary levels (B+) through statistical fitting, but faces fundamental barriers to achieving large-scale evocation (A) and extreme evocation (A+). The root of this barrier lies in AI's lack of embodied experience and a latent evocative reservoir, rendering it incapable of autonomously generating the power to simultaneously illuminate hidden fragments.
The practical corollary of this judgment is the repositioning of the human-machine collaborative relationship. In human-machine collaborative creation, imagination is the core productive force: the human creator is responsible for providing this productive force; AI is responsible for exhaustively enumerating optimal expressive schemes within parameter space; prompt engineering is the key link that translates imagination from the human brain to the machine. The three prompt engineering strategies—physiological arousal description, multimodal simulation prompts, and strategic blank-leaving—provide operational paths for this translation process from the three dimensions of emotion, perception, and structure, respectively. At the same time, AI hallucination, an attribute long regarded as a flaw, possesses potential value as a trigger source for imagination, an accelerator of evocation, and an enhancement mechanism for blank-leaving in the specific context of artistic creation. This value does not alter the fundamental judgment that "AI lacks autonomous imagination," but it opens a new dimension for human-machine collaboration: not a one-way injection of imagination from human to AI, but an imaginative interaction between human and AI—the human providing core imagination, AI providing random perturbation, the two jointly approaching a greater possibility through iterative feedback.
The final stance of this paper is neither praise nor depreciation of AI's creative capabilities, but the delineation of a new creative relationship: in human-machine collaborative creation, imagination has never been as important as it is today. It is not an obsolete capacity superseded by technology, but the core capacity for mastering technology. Those fragments hidden deepest in the human heart may be closer than ever before to the moment of being precisely, violently, and irreversibly "detonated."
A companion paper, The Serendipity of Imagery: On the Isomorphism between AI Hallucination and Aesthetic Mechanisms in Chinese Lyric Writing (in Chinese, forthcoming), further explores the practical application of the "hallucination as resource" framework in the specific domain of Chinese lyric poetry composition, offering an empirical case study for the theoretical framework established herein.

10. Hypotheses, Verification, and Call for Empirical Research

The core contribution of this paper lies in proposing a testable theoretical framework, rather than claiming to have discovered ultimate truth. The process of science is not from correctness to correctness, but from hypothesis to verification;the value of a good hypothesis lies not in whether it is immediately confirmed, but in whether it is sufficiently clear, falsifiable, and worthy of being tested.

10.1. List of Testable Hypotheses

This paper proposes the following testable hypotheses:
  • Hypothesis One: Physiological arousal descriptions outperform emotion labels. In prompts for AI music/sound effect creation, using physiological arousal descriptions enhances the audience's emotional experience intensity and physiological arousal level more effectively than using emotion labels.
  • Hypothesis Two: Multimodal simulation prompts outperform purely visual imagery. In prompts for AI literature/film script creation, incorporating multi-sensory embodied details enhances the audience's multimodal scenario simulation intensity and Yijing experience depth more effectively than purely visual imagery accumulation.
  • Hypothesis Three: Strategic blank-leaving has an optimal range. In AI art creation, instructing AI to manufacture controlled imperfections or blanks produces superior evocation effects compared to "flawless" output; however, the degree of blank-leaving has an optimal range—too little produces no effect, too much becomes genuine defect.
  • Hypothesis Four: Screened output from AI hallucination can serve as a trigger source for imagination. The unconventional imagery combinations produced by AI hallucination, after human creator screening and secondary processing, can trigger the human creator's imagination and ultimately elevate the evocation level of the work.
  • Hypothesis Five: Evocation has levels, and higher levels require the participation of a latent evocative reservoir. Aesthetic evocation is a spectrum from shallow to deep (B+ → A → A+); higher-level evocation (A level and above) requires the participation of the audience's latent evocative reservoir, and therefore depends more heavily on individual differences among audience members.

10.2. Verification Framework

All of the above hypotheses can be tested through randomized controlled experiments. The core approach is: under conditions controlling for variables such as AI model, creative theme, and work duration, systematically compare the differences in audience experience produced by works generated from different prompt strategies. Dependent variables should simultaneously encompass subjective ratings (degree of "being struck," Yijing depth) and objective physiological indicators (skin conductance, heart rate variability), in order to capture the dual psychophysiological effects of evocation. The optimal range hypothesis for strategic blank-leaving requires setting up multiple gradient comparison groups (e.g., 0%, 10%, 20%, 30%, 40% blank-leaving degree) to test whether evocation effects follow an inverted U-shaped curve. The catalytic effect of AI hallucination can be tested by comparing the final work quality of "creators receiving hallucination output" versus "creators receiving conventional output." The association between latent evocative reservoir and evocation level requires introducing individual difference measurements, testing whether higher-level evocation depends more on the audience's embodied experience richness and mental imagery vividness.

10.3. Call for Empirical Research

The above hypotheses and their verification frameworks constitute a broad roadmap from this paper's theoretical framework toward empirical research. These experiments are entirely feasible technically—they require not revolutionary new technologies or equipment, but standardized experimental design, volunteers with appropriate sample sizes, and API access to AI models.
This paper earnestly calls upon researchers in the fields of cognitive neuroscience, empirical aesthetics, artificial intelligence, and human-computer interaction to undertake verification of the above hypotheses. If these hypotheses are confirmed, they will provide the first set of empirically grounded prompt design guidelines for AI art creation, advancing the strategic shift from "alignment" to "evocation" from theory to practice. If certain hypotheses are falsified, that, too, constitutes valuable progress—falsification means we have eliminated certain seemingly plausible paths, and the clarity of the theory will be enhanced accordingly.
Whether confirmed or falsified, the act of testing itself is the greatest affirmation of this paper. For the ultimate purpose of a hypothesis-driven paper is not to be believed, but to be tested.

Author Contributions

Conceptualization, X.Y.; methodology, X.Y.; formal analysis, X.Y.; investigation, X.Y.; writing—original draft preparation, X.Y.; writing—review and editing, X.Y. The author has read and agreed to the published version of the manuscript. Hongsheng Li contributed professional knowledge in computer science and participated in content review.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

The author thanks the anonymous reviewers for their valuable comments. During the preparation of this manuscript, the author used DeepSeek (large language model) for grammar and style refinement. The author has reviewed and edited the output and takes full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Barsalou, Lawrence W. Perceptual symbol systems. Behav. Brain Sci. 1999, 22(4), 577–660. [Google Scholar] [CrossRef] [PubMed]
  2. Friston, Karl. The free-energy principle: A unified brain theory? Nat. Rev. Neurosci. 2010, 11(2), 127–138. [Google Scholar] [CrossRef] [PubMed]
  3. Gallese, Vittorio; Lakoff, George. The brain’s concepts: The role of the sensory-motor system in conceptual knowledge. Cogn. Neuropsychol. 2005, 22(3–4), 455–479. [Google Scholar] [CrossRef] [PubMed]
  4. Holm-Hadulla, Rainer M.; Bertolino, Aljoscha. Creativity, alcohol and drug abuse: The pop icon Jim Morrison. Psychopathology 2014, 47(3), 167–173. [Google Scholar] [CrossRef] [PubMed]
  5. Iser, Wolfgang. The Act of Reading: A Theory of Aesthetic Response; Johns Hopkins University Press: Baltimore, 1978. [Google Scholar]
  6. Koestler, Arthur. The Act of Creation; Hutchinson: London, 1964. [Google Scholar]
  7. Ludwig, Arnold M. Alcohol input and creative output. Br. J. Addctn. 1990, 85(7), 953–963. [Google Scholar] [CrossRef] [PubMed]
  8. Martindale, Colin. The Clockwork Muse: The Predictability of Artistic Change; Basic Books: New York, 1992. [Google Scholar]
  9. Norlander, Torsten; Gustafson, Roland. Effects of alcohol on a divergent figural fluency test during the illumination phase of the creative process. Creat. Res. J. 1998, 11(3), 265–274. [Google Scholar] [CrossRef] [PubMed]
  10. Ramachandran, V. S.; Hirstein, William. The science of art: A neurological theory of aesthetic experience. J. Conscious. Stud. 1999, 6(6–7), 15–51. [Google Scholar]
  11. Sayette, Michael A.; et al. Alcohol and creativity: A meta-analysis. Psychol. Bull. 2012, 138(4), 637–663. [Google Scholar] [CrossRef] [PubMed]
  12. Substance Abuse and Mental Health Services Administration (SAMHSA). Substance Use and Substance Use Disorders by Industry. SAMHSA: Washington, DC, 2007. Available online: https://store.samhsa.gov/product/substance-use-and-substance-use-disorders-by-industry/sma07-4294.
  13. Yi, Xianqun. The Serendipity of Imagery: On the Isomorphism between AI Hallucination and Aesthetic Mechanisms Lyric Writing. In Forthcoming; in Chinese; 2026. [Google Scholar]
  14. Zeki, Semir. Inner Vision: An Exploration of Art and the Brain; Oxford University Press: Oxford, 1999. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated