3. Methodology
3.1. Research Framework and Experiment Design
This study employed a multimodal experimental design grounded in the EBD framework to investigate the influence of AI-generated biophilic façade variations on visual attention and emotional responses in automated retail contexts. As illustrated in
Figure 1, the research framework followed a three-stage EBD process encompassing environmental analysis, reasoning and inference, and generation, evaluation, and feedback, enabling a structured integration of design logic, generative modeling, and human-centered assessment.
In EBD Stage 1, the environmental analysis was conducted to identify key challenges associated with automated retail environments, including limited social interaction, potential sensory deprivation, and elevated psychological stress. To address these issues from a health-oriented perspective, relevant WELL Building Standard criteria, particularly those related to mental well-being and biophilic design, were reviewed and synthesized. This stage established the environmental and theoretical basis for subsequent visual dimension mapping, ensuring that design variables were grounded in both contextual needs and established health principles. Recent findings have supported this focus on environmental context, showing that the specific context in which information is processed significantly modulates the emotional responses captured via facial expression analysis [
42].
EBD Stage 2 focused on reasoning and inference using a Recursive Object Model (ROM) to translate environmental requirements into design-operational variables. Based on the literature review, WELL standard mapping, and case study analysis, three core visual dimensions (color, material, and pattern) were identified as primary façade design variables. These dimensions were embedded into a directional prompt vocabulary composed of core design elements, descriptive modifiers, and WELL-integrative labels. This structured prompt system guided AI-based façade generation using Stable Diffusion XL with ControlNet, allowing targeted local regeneration while maintaining geometric consistency across images. To ensure methodological rigor, a prior validation process combining perceptual similarity control (LPIPS) and expert-guided evaluation was applied, resulting in the selection of eight façade stimuli representing distinct combinations of the extracted visual dimensions.
In EBD Stage 3, the generated façade stimuli were evaluated through a multimodal assessment framework integrating objective and subjective measures. Visual attention was captured using eye-tracking metrics, including revisit count, dwell time (DT), fixation count (FC), and fixation duration (FD). Emotional responses were assessed through facial behavior indicators derived from facial expression analysis, while subjective appraisal was collected using a semantic differential (SD) scale to capture participants’ conscious aesthetic evaluations. The results of these evaluations informed an exploratory refinement phase, forming an EBD feedback loop that supports iterative reflection on the relationship between design variables, perceptual responses, and evaluation outcomes.
3.2. Participants
Participants for this study were recruited through digital announcements distributed across university online communities and professional social networks. Initially, 48 volunteers provided informed electronic consent according to the ethical protocols approved by the Institutional Review Board (Approval No. 1040395-202510-01). The inclusion criteria specified that the participants had to be between 18 and 50 years of age with normal or corrected vision of at least 0.8 to ensure high-quality eye-tracking data collection.
A power analysis conducted using G*Power 3.1.9.7 determined that a minimum sample of 27 was necessary to achieve a statistical power of 0.85 based on an effect size of 0.20 and a significance level of 0.05 for eight repeated measures. To account for potential data loss and technical challenges associated with remote online testing, the initial recruitment was expanded to 48 individuals. Following a rigorous data quality screening within the iMotions platform, which excluded records with incomplete gaze trajectories or insufficient facial illumination, a final cohort of 30 participants was retained for the analysis. This final group consisted of 15 males and 15 females. Age was divided into ranges, with the vast majority of the sample (86.7%) belonging to the 18- to 25-year-old category, representing a tech-savvy demographic that frequently utilizes automated retail environments.
3.3. Experimental Stimuli
The stimuli were developed by integrating EBD methodology with generative artificial intelligence, as established in our preliminary research [
16,
27]. Building upon these established AI generative strategies, the simulation process utilized the Stable Diffusion XL (SDXL) model in conjunction with the ControlNet Depth model to ensure precise structural fidelity to the original architectural form. This procedure began with a high-resolution photograph of an existing automated retail storefront, which served as the control condition designated as I1. Seven experimental variations were then synthesized by systematically modulating three biophilic design parameters, i.e., materiality, pattern density, and color organicism, derived from the WELL Building Standard.
The ControlNet Depth model was specifically employed to extract the spatial depth of the storefront, allowing for the application of complex organic textures while maintaining the original building proportions and spatial depth. The generative procedure involved applying specific prompt frameworks to modify the façade attributes, resulting in a gradient of biophilic complexity. These variations included isolated modifications, such as color only (I2), material only (I3), and pattern only (I4), alongside integrated configurations including color and material (I5), color and pattern (I6), material and pattern (I7), and a comprehensive synthesis of all three elements (I8).
All eight images were standardized to a resolution of 1125 by 844 pixels to ensure visual consistency. Each stimulus was presented for a fixed duration of 30 seconds, separated by a 3-second black screen interval to prevent visual carryover effects. These stimuli, including the design ID, stimulus type, and corresponding images, are presented in
Figure 2.
3.4. Experiment Setup and Procedure
3.4.1. Experiment Environment and Setup
The investigation was conducted utilizing a remote webcam in strict accordance with the approved IRB protocol. Participants completed the study individually in a naturalistic online environment using their own desktop computers equipped with high-definition webcams. As illustrated in
Figure 3, volunteers maintained a stable and upright posture, facing the monitor directly at a viewing distance of approximately 50-60 cm [
43]. The webcam, positioned above the display, facilitated the simultaneous recording of gaze trajectories and facial behavioral markers through the iMotions Online platform. A minimum camera resolution of 720p was required to ensure the precision of gaze tracking and facial marker extraction.
Prior to the formal stimuli presentation, participants received comprehensive online instructions regarding posture adjustment, screen alignment, and lighting optimization to prevent backlighting interference. A rigorous 13-point calibration procedure was implemented to map individual ocular characteristics to the screen coordinates. The entire experimental procedure was non-invasive and required no additional wearable sensors or peripheral devices, complying with the ethical standards for human subject research approved by the IRB. After the image viewing task, a questionnaire was administered to capture conscious perceptions.
3.4.2. Experiment Procedure
The experimental protocol followed a structured five-stage sequence with a total duration of approximately 25 to 30 minutes per participant.
Preparation and Informed Consent: Participants initially accessed the study via a secure web link. They were presented with a digital briefing regarding the research objectives and data privacy measures. In accordance with the approved IRB protocol, participants provided their informed electronic consent before proceeding to the technical setup.
System Calibration: After adjusting their seating position and lighting, participants completed the 13-point calibration process to ensure eye-tracking precision. Only participants who met the minimum accuracy threshold were permitted to continue to the stimulus phase.
Stimuli Presentation: The eight biophilic façade variations were presented in a randomized order to eliminate sequence bias. Each image was displayed for a fixed duration of 30 seconds, interspersed with a 3-second black screen interval to reset visual fixation. During this period, iMotions synchronously captured subconscious physiological markers.
Subjective Evaluation (SD Survey): Immediately following the final stimulus, participants completed a Semantic Differential scale questionnaire. They evaluated each façade design based on 10 bipolar adjective pairs to capture their conscious aesthetic and psychological assessments.
Reward and Exit: Upon successful completion of the survey, participants were directed to the exit screen. As an incentive for their participation, a mobile gift icon valued at 10,000 KRW was distributed to the provided contact information within 24 hours of the session.
Figure 4.
Overview of the experimental procedure.
Figure 4.
Overview of the experimental procedure.
3.5. Data Quality Control and Pre-Processing
To ensure the precision of the multimodal biometric analysis, a rigorous data screening protocol was applied to the raw signals captured from the 48 initial participants. Each trial was evaluated against three specific quality thresholds established for remote eye-tracking and facial behavioral analysis. First, the gaze tracking accuracy was required to be within 1.3 visual degrees to maintain the spatial validity of the Areas of Interest (AOI). Second, a minimum data completion rate of 90% was mandated for both gaze trajectories and facial expression streams to ensure the temporal continuity of the behavioral responses. Finally, the synchronization of multimodal data was verified through the iMotions WebET system to confirm that physiological reactions were accurately mapped to the corresponding design stimuli.
The study implemented a 13-point calibration procedure before the formal trials to optimize the webcam-based acquisition. Following the application of these technical criteria, 18 participants were excluded due to signal instability or calibration failure, resulting in a refined final analysis sample of 30 participants . This screening process guaranteed that the subsequent inferential statistics were conducted on high-fidelity data, mitigating the noise inherent in remote experimental settings.
3.6. Eye-Tracking Metrics and AOIs Definition
To facilitate a structured evaluation of visual attention across the biophilic variations, the storefront façades were partitioned into three functionally and geometrically distinct Areas of Interest (AOIs). These three AOIs allowed for a detailed analysis of how AI-generated features modulate visual engagement across different architectural components:
Glass Façade AOI (AOI 1): This zone encompassed the glass surfaces located on both sides of the central entrance. Due to the symmetrical layout of the façade, this AOI consisted of two distinct segments (identified as 1a and 1b during data collection), which were consolidated for the final analysis to evaluate the holistic effect of biophilic patterns applied to the transparent interface.
Signage Sides AOI (AOI 2): This area comprised the façade sections flanking the main signage. It measured the secondary visual attraction elicited by biophilic interventions around the store's identity zone. Like AOI 1, this region was recorded as two lateral segments (identified as 2a and 2b) and subsequently merged to represent the combined visual load on the signage periphery.
Signage Area AOI (AOI 3): This zone focused exclusively on the main signage area. Analyzing this AOI revealed the extent to which biophilic designs on the surrounding façade either complement or compete with the brand identity for the viewer's attention.
Within these AOIs, four primary eye-tracking metrics were extracted to quantify visual engagement:
Dwell Time (ms): The total duration of all fixations and saccades within a specific AOI, representing the overall attentional investment.
Fixation Count: The total number of fixations within an AOI, indicating the level of visual processing and information extraction.
Revisit Count: The number of times a participant's gaze returned to an AOI after looking away, serving as an indicator of sustained visual interest or cognitive re-engagement.
Fixation Duration (ms): The average length of individual fixations, reflecting the depth of cognitive processing and the visual complexity of the stimulus.
Table 1.
Definitions and settings of the Area of Interest (AOI).
Table 1.
Definitions and settings of the Area of Interest (AOI).
| AOI Name |
Physical Coverage |
Stimulus with coloring AOIs |
| Façade_AOI |
Building façade, featuring texture, material, and patterns |
 |
3.7. Facial Expression Markers and Behavioral Indicators
Subconscious behavioral markers, specifically head orientation (Yaw) and the activation of Lip Suck, Nose Wrinkle, and Lid Tightening, were extracted to evaluate pre-conscious cognitive processing. The selection of these biometric markers aligns with contemporary multi-modal frameworks that utilize synchronized physiological data to quantify implicit human-technology interactions [
40]. Within this framework, each specific indicator was selected for its capacity to reveal distinct layers of environmental perception:
Yaw (Head Orientation): This metric served as a direct indicator of postural engagement and visual curiosity. In the context of architectural interfaces, subtle adjustments in head orientation reflect an involuntary intent to explore spatial details and complex organic textures, signaling an instinctive attraction to the stimuli.
Lip Suck: In facial behavioral analysis, the activation of Lip Suck is increasingly recognized as a marker of cognitive effort and pre-conscious evaluative processing. It indicates that the viewer is deeply engaged in decoding the visual complexity of the biophilic patterns, reflecting an automatic physiological response to environmental information.
Lid Tightening and Nose Wrinkle: These markers are associated with perceptual scrutiny and concentrated focus. They facilitate the identification of a transition from passive viewing to active sensory decoding, particularly when participants encounter highly intricate AI-generated designs that demand greater cognitive activation.
The synchronization of these implicit markers allows for a high-resolution mapping of the user experience beyond conscious verbal reporting.
3.8. Subjective Evaluation: SD Scale and Validation
Immediately following the biometric recording, participants completed a Semantic Differential (SD) questionnaire to provide conscious aesthetic evaluations. The scale comprised ten adjective pairs reflecting the dimensions of the WELL Mind concept and architectural perception. To ensure scientific rigor, the reliability and validity of the instrument were tested. The analysis yielded Cronbach’s alpha of 0.960 and a KMO value of 0.944, with a highly significant Bartlett’s test result (p < 0.001), confirming that the subjective data provided a stable baseline for comparison. The adjective pairs are presented in
Table 2.
3.9. Data Processing and Statistical Analysis
Statistical analyses were performed using SPSS version 27.0 and Python. Four primary metrics for the eye tracking component were Revisit Count, Fixation Duration, Dwell Time, and Fixation Count. Gaze data were rescaled using min-max normalization into a standardized zero-to-one interval to facilitate a descriptive analysis of visual attention distribution. Furthermore, these values were mapped to a 0 to 100 scale for radar chart visualization to enhance legibility and allow for a consistent comparison of design features across the metrics.
Physiological and behavioral responses were assessed using four specific markers, Yaw, Lip Suck, Nose Wrinkle, and Lid Tightening. These indicators were recorded synchronously via the iMotions platform to capture subconscious cognitive and emotional arousal levels. For the subjective evaluation, the internal reliability of the ten adjective pairs on the Semantic Differential scale was verified using Cronbach’s alpha coefficients. Descriptive statistics, including mean values and standard deviations, were calculated for each psychological dimension to establish the conscious perception level of the control configuration.
To assess the efficacy of the biophilic façade variations, a univariate general linear model with a repeated measures design was utilized for the facial behavioral markers and subjective data streams. In addition, repeated measures correlation analysis performed via Python-based statistical scripts was conducted to quantify the relationship between implicit physiological markers and explicit psychological evaluations. This approach was specifically chosen to account for individual-level variability by controlling for between-subject differences inherent in repeated exposure designs, thereby isolating within-subject effects across stimulus conditions. Greenhouse-Geisser corrections were applied to adjust the degrees of freedom when the sphericity assumption was violated. Post-hoc comparisons with Bonferroni adjustments were conducted to identify significant differences between specific stimulus pairs. Statistical significance was maintained at a p-value of less than 0.05, while the magnitude of the correlation coefficients was qualitatively interpreted based on the effect size conventions established by Cohen [
44].