III. Validation of the New Approach
While extensive validation through comprehensive studies is typically necessary for a novel approach, such validation is beyond the scope of this initial study. Instead, we focus on the foundational criteria required for an effective method in detecting redundancy in psychological constructs. Specifically, the approach should demonstrate three key capabilities: (1) converging items with high correlations, (2) discriminating between items with low correlations, and (3) identifying semantic overlap across different scales.
To assess these capabilities, we conducted three experiments. The first experiment tested the approach’s ability to converge highly correlated items and discriminate low-correlated items. The second experiment evaluated its capacity to identify semantic overlap across different scales. Finally, the third experiment served as a robustness check and offered a series of comparative analyses of the overall competence of approach.
The experiment aims to evaluate ESAA’s ability to converge highly correlated items and discriminate low-correlated items. If ESAA has reliable convergence ability, then EGF should categorize those items measuring a same underlying conceptual construct together, with intra-cluster SPIC at least as high as the corresponding intra-scale SPIC. In contrast, if the EGF has sufficient discriminant ability, it should allocate items from scales measuring conceptually distinct constructs into different clusters, with inter-cluster SPIC much lower than corresponding intra-cluster SPIC, and not higher than the corresponding inter-scale SPIC.
Methods
Materials
To test ESAA’s convergence and discrimination abilities, the experimental materials must meet certain criteria: (1) each scale should be widely recognized for high internal consistency, (2) the concepts should exhibit low correlation, as supported by prior studies, and (3) both scales should have an equal number of items.
In line with these criteria, the Conscientiousness and Gratitude scales were chosen:
Conscientiousness Scale: Derived from the NovoPsych Five Factor Personality Scale -30 (Buchanan & Hegarty, 2023), this 6-item scale measures the well-established five factor model of personality (a.k.a. OCEAN). A sample item is “I often do just enough work to get by” (reverse scored).
Gratitude Scale: The Gratitude Questionnaire-Six Item Form (GQ-6: McCullough et al., 2002) includes 6 items, with a sample item being “I have so much in life to be thankful for”.
Both scales are well-validated with high internal consistency and the two concepts exhibit minimal to no correlation (Ajmal et al., 2016; Kong et al., 2020), making them ideal for examining convergence and discrimination in psychological research.
Hypotheses
Given the nature of these scales, the items measuring Conscientiousness and Gratitude should map distinctly into two separate regions, in the semantic space of embeddings, with the distance between these regions significantly greater than within-region distances.
Hypothesis 1. The macro-structure of the EGF will match the SOF; and in the EGF, the inter-cluster SPIC will be significantly lower than the intra-cluster SPIC, with a large effect size.
Results and Discussion
The hierarchical clustering analysis conducted for these 12 items reveals two distinct clusters, each corresponding precisely to one of the original scales. Embeddings from the same scale are closely grouped together, while embeddings from different scales are clearly separated, as shown in
Figure 2a,b.
Figure 2a displays the dendrogram of the EGF, which shows the hierarchical clustering of the item embeddings from the Conscientiousness and Gratitude scales, using WARD method. On the y-axis, the dendrogram indicates the semantic distance between clusters, while the x-axis lists the individual embeddings of the items being clustered. The clusters are color-coded for clarity, and item indices are labeled as “C” for Conscientiousness scale items and “G” for Gratitude scale items.
Figure 2b presents a Kernel Density Estimate (KDE) plot of the dimension-reduced embeddings. To generate
Figure 2b, we followed a three-step process: (1) semantic distance computation, (2) reducing the embedding dimensions to two using PCA, and (3) applying a KDE to the PCA dimensional reduction results, with the color-label for items follows EGF, i.e., the results of previous clustering. This plot clearly highlights the separation and cohesion of clusters of the EGF.
As mentioned in the section II of this study, intra-cluster SPIC values represent the semantic coherence within each cluster, while inter-cluster SPIC values reflect the divergence between different clusters. The average SPIC values for the intra-cluster pairs within the Gratitude Scale and Conscientiousness Scale (M = 0.426, SD=0.0039) were significantly higher than those for inter-cluster pairs (M = 0.193, SD=0.0022), demonstrating that item pairs within the same cluster are closer to each other than those between different clusters. To examine whether such difference between intra-cluster and inter-cluster SIC values is significant, we conducted statistic test. Welch’s t-test is adopted due to unequal variances and sample sizes. The results revealed a significant difference, t (43.35) = 18.46, p < .001. The effect size, measured by Cohen’s d, was 4.50, indicating a very large effect. These findings suggest that intra-cluster similarities are significantly higher than inter-cluster similarities.
These results support H1, demonstrating that the ESSA effectively distinguishes between concepts and converges items within the same concept. Therefore, we can confidently reject the null hypothesis, which suggest that ESAA lacks convergence or discriminant abilities.
However, it is important to note that these conclusions are limited to the specific materials used in this experiment. Similar to a Turing test, the experiment 1 was designed to explore the capabilities of a new technique. While the results from this single experiment are promising, they only serve as a necessary condition for demonstrating ESAA’s effectiveness. Extensive empirical validation through future research is required to fully confirm the approach’s generalizability and robustness.
The aim of this experiment is to assess the effectiveness of the ESAA in detecting semantic proximity between scale items that measure psychological concepts with potential redundancy. These concepts may be so similar that they lack incremental validity—meaning they do not contribute unique information beyond what is already captured by other concepts. As ESAA is designed to help researchers identify redundancy in psychological concepts, demonstrating this overlap-detection capability is crucial for validating its utility in future studies.
Methods
Materials
To assess the ESAA’s capacity for detecting redundancy, it is important to select psychological concepts that have been explicitly recognized in the literature as containing overlapping elements. This ensures a clear benchmark for evaluating the reliability of ESAA’s results.
The concepts of Grit and Conscientiousness were chosen as they meet this criterion. Grit was doubted for its redundancy with Conscientiousness by many research. A meta-analysis of Grit research (Credé, Tynan, & Harms, 2017), which analyzed 584 effect sizes across 88 independent samples (totaling 66,807 individuals), revealed that Grit, initially proposed as a higher-order trait predicting of success and performance, shows an excessively strong correlation (ρ = .84) with Conscientiousness, raising questions about challenging its distinctiveness as a construct.
This experiment utilized the following scales:
Conscientiousness Scale: Same as that in the Experiment 1.
Grit Scale: derived from Duckworth et al. (2007), this scale contains 12 items divided into two facets: Perseverance of engagement and Consistency of interest, with 6 items per facet. A sample item is “I have achieved a goal that took years of work”.
Given the materials, the SOF in Experiment 2 can be visualized as shown in
Figure 3, with grit and conscientiousness are treated as distinct categories, with Grit’s two facets as sub-categories.
Hypotheses
If the ESAA has qualified capability to identify redundancy, the EGF is expected not to mirror the SOF, and should exhibit signs semantic blending between Conscientiousness and grit items. This is because the EGF, as a deliberately designed tool, is expected to achieve optimized semantic structure, while the SOF is doubted to be lacking of discrimination. According to extensive literature, SOF would not have satisfied discriminate validity. As Credé, Tynan, and Harms (2017) noted, the perseverance facet of grit and conscientiousness was reported to have a correlation of ρ = .89, far exceeds the typical correlation found between two different global measures of conscientiousness (ρ = .63, according to Pace & Brannick, 2010). Thus, the meta-analysis suggested that “grit is not a higher-order construct characterized by two lower-order facets” and “may be redundant with conscientiousness”. This leads to the hypothesis for the current experiment:
Hypothesis 2.
The structure of the EGF will not be identical to the SOF, with items from Grit and Conscientiousness interfused, rather than separated.
Results and Discussion
The application of the ESAA on the items from the Grit and Conscientiousness scales resulted in two major clusters, as revealed in the hierarchical structure depicted by
Figure 4a,b. The full merging process is visualized in the dendrogram in
Figure 4a, which illustrates the hierarchical clustering of item embeddings from the two scales based on WARD method. The y-axis represents the dissimilarity between clusters, while the x-axis shows individual items, color-coded for clarity. Items from the Conscientiousness scale are labeled “C”, while those from the Grit scale are labeled “G”. “Cluster 1” in
Figure 4 notably includes a mix of items from the Grit scale along with all of the Conscientiousness items. This suggests that, in terms of semantic proximity, some Grit items are closer to Conscientiousness, leading the algorithm to group them together. Specifically, the part of grit items involving the interfusion are exactly those measuring the so-called Perseverance of Engage facet, which evidence is perfectly consistent with the conclusion of the meta-analysis by Credé, Tynan, and Harms (2017).
Figure 4b provides further evidence of semantic overlap, displaying a KDE plot based on the Dimension-Reduction of the embeddings. This figure highlights the interweaving of Grit and Conscientiousness items, confirming the semantic fusion between them. The generation of
Figure 4b followed the same three-step process as in Experiment 1, including semantic distance computation, dimensionality reduction, and KDE visualization. These results clearly support Hypothesis 2, which predicted inter-fusion between items from the Grit and Conscientiousness scales. In sum, the result of ESAA, i.e., the EGF of Experiment 2, supports H2, being highly consistent with the literature suggesting the redundancy of the grit concept.
Moreover, EGF performs well in terms of convergence and discriminate validity, as detailed statistics shown in
Table 1. The intra-cluster SPIC of EGF
(M = 0.372, SD = 0.132, n = 81) was significantly higher than the inter-cluster SPIC
(M = 0.294, SD = 0.071, n = 72). Welch’s T-Tests were conducted instead of a traditional t-test due to the significant heterogeneity of variance observed in the data,
t (126) =4.572, p<0.001. The large effect size (Cohen’s
d = -0.717) indicate that the pairwise semantic similarity of items within and between cluster in EGF differ significantly. Besides, the difference between intra-scale and inter-scale SPICs for SOF also reached significance, but the all the statistic values indicating performance of framework are inferior to those of EGF.
In summary, the results of Experiment 2 demonstrate that the ESAA successfully identified semantic interfusion between the Grit and Conscientiousness scales, as evidenced by the hierarchical structure and Reduced Dimensional KDE plots. Hypothesis 2, which predicted this inter-fusion, was supported by the formation of the mixed-source cluster, highlighting the overlapping semantic nature of the two concepts. Furthermore, statistics of SPICs demonstrate the validity of this EGF. Overall, the results in Experiment 2 indicate that ESAA has potential to serve as a reliable tool for detecting redundancy among psychological constructs, addressing the study’s objective of validating ESAA’s overlap-detection competence.
The results from the previous two experiments implies ESAA as a potential tool for redundancy research on psychological concepts. However, before drawing a validating conclusion, two arguments remain: 1) Are the results from the first two experiments robust? Experiments 1 and 2 selected corresponding experimental materials independently. However, if the experimental materials were input in different way, will the ESAA’s calculations remain stable? This is the “robust” argument. 2) Does ESAA provide added value compared to baseline tools? A sample and intuitive baseline is the use of chatbots based on LLMs, such as ChatGPT. If similarly effective analytical results can be generated through simple prompts, there may be no substantial benefit in developing a new tool. Therefore, it becomes crucial to compare the performance of the ESAA with existing chatbot. The comparison will help determine whether the ESAA offers genuine innovations and improvements. This is the “incremental value” argument.
To address these questions, we designed a series of trials in this experiment applying the ESAA and alternative approaches on same input material, and comparing the performance of their outputs, namely the EGF and other frameworks. The hypotheses are that ESAA has robustness, and having incremental value, which means the outputs of ESAA are consistent with each other and the performance of EGF beat any other frameworks in terms of convergence, discriminate, and redundancy detection performance.
Material and Procedure
The materials for this experiment are a combination of those used in the previous two experiments, incorporating three scales measuring Conscientiousness, Grit, and Gratitude (24 items in total). The procedure involved generating the frameworks, followed by three series of trials comparing the EGF with the frameworks generated by alternative approaches.
The robustness of ESAA was evaluated by comparing the EGF in from Experiments 3 from those from Experiments 1 and 2. We expected the EGF of the current experiment to align with the previous ones, confirming the stability of ESAA’s output. Secondly, to assess ESAA’s incremental value, we generated two baseline frameworks using the most advanced LLM-based chatbots (GPT-4.0) in present, i.e., ChatGPT 4o and o1, anticipating that the chatbot-generated frameworks would be inferior to EGF in terms of the convergence, discrimination, and interpretability.
Results and Discussion
The clustering results in
Figure 5a,b clearly reveal the hierarchical structure of the EGF for Experiment 3.
Figure 5a presents the dendrogram of the ESAA analysis, which illustrates the hierarchical clustering of item embeddings from the Conscientiousness and Gratitude scales, based on the WARD method. The y-axis shows the semantic distance between clusters, while the x-axis lists individual item embeddings, color-coded by clusters. Items from the Conscientiousness scale are labeled “C,” Gratitude scale items are labeled “G,” and the two facets of Grit—Consistency of Interest and Perseverance of Effort—are labeled with “I” and “E,” respectively. When the clustering threshold was set at 0.7 for the mean of distance between clusters at merge, the items were divided into four distinct groups, corresponding to the original subscales. However, a closer look at the dendrogram reveals some semantic blending, particularly between Grit and Conscientiousness items. For example, the Perseverance of Effort facet of Grit initially clusters with Conscientiousness items before later merging with the Consistency of Interest facet, indicating a notable overlap between these two constructs. This merged group is then combined with the Gratitude items at a more distant clustering level.
The step-by-step merging process is further visualized in
Figure 6a–c, which depict the distribution of data points after reducing the high-dimensional embedding space to two dimensions using PCA. In
Figure 6a, the data points are divided into four classes, matching the original subscales. As seen in
Figure 6b, when the clustering is reduced to three classes, the Conscientiousness items merge with the Perseverance of Effort facet from the Grit scale, reflecting their semantic proximity. Finally,
Figure 6c illustrates the division into two major clusters, where the combined Grit and Conscientiousness items form a single category that merges with Gratitude at a higher level of abstraction. This merging pattern highlights the structural relationships between the concepts, suggesting significant overlap between Grit and Conscientiousness, while the separation from Gratitude demonstrates ESAA’s ability to discriminate between more distinct concepts. This result showcases ESAA’s effectiveness in capturing both convergence and distinction across psychological scales, particularly in identifying concepts with overlapping semantic features.
The information suggested by EGF for overlap shows statistical significance. Specifically, the facet E items from the Grit scale first merges with the Conscientiousness scale items, resulting in the formation of cluster 2.2. This cluster 2.2 (M = 0.535, SD = 0.090) exhibits a higher mean of intra-group SPICs compared to the Grit scale (M = 0.355, SD = 0.138), with a significant difference observed (Cohen’s d = 1.382, p < 0.001).
Additionally, the discriminant validity among the (sub)clusters was also confirmed. The intra-group Synthetic Pairwise Item Correlations (SPICs) for each (sub)cluster were aggregated and used as a baseline for comparison, and the results indicated significant discriminant validity for all comparisons (
p<0.001). The effect sizes for these comparisons, as shown in
Table 2, were substantial, with Cohen’s d values indicating large effects across all clusters.
The comparative analysis results were fully in line with expectations and are further summarized in
Table 3, which highlights the consistency of various framework outputs with theoretical expectations. The robustness check confirmed that the EGF from Experiment 3 was entirely consistent with those from Experiments 1 and 2, with Gratitude items forming a distinct sub-cluster, identical to the EGF structure from Experiment 1. This consistency reinforces the stability of ESAA’s calculations.
In terms of incremental value, the ChatGPT models (versions 4o and o1 preview) failed to generate frameworks that aligned with theoretical expectations as effectively as ESAA, confirming ESAA’s superiority in terms of convergence, discrimination, and interpretability. Although the ChatGPT-generated frameworks captured the macro-structure, they were unable to provide consistent sub-structures, thus falling short of the precision achieved by ESAA.