1. Introduction
Psychological assessment enables the understanding and evaluation of qualities, attributes, constructs, and behavior patterns (Cherry et al., 2021; Kapur et al., 2022). Participants’ responses allow for inferences about psychological constructs or attributes, which are systematically organized into standardized questionnaires. These instruments must be effective and precise to ensure accurate measurement of the constructs or attributes (Hunsley & Meyer, 2003; Loewenthal & Lewis, 2020; Meyer et al., 2001; Reynolds et al., 2021; Rönkkö & Cho, 2022; Schneider et al., 2024; Strauss & Smith, 2009; Wilson, 2023). The quality of these questionnaires depends on the rigor of the methodology used during their development (Einola & Alvesson, 2021).
Exploratory Factor Analysis (EFA) is the most commonly used statistical method for test development (Goretzko et al., 2021), while Confirmatory Factor Analysis (CFA) is widely applied to validate the evidence obtained from EFA (Marsh et al., 2020; Tavakol & Wetzel, 2020; Widaman & Helm, 2023). However, depending on the context, the development of assessment instruments across various areas of psychology often remains in its psychometric infancy. This complex issue has led to numerous proposals for adapting psychometric questionnaires to ensure rigorous measurement by demonstrating reliability and validity evidence—essential aspects for research and clinical practice (Muñiz, 2018; Shroff et al., 2023).
The adaptation process begins with translation or linguistic adaptation, but this is merely the starting point of an exhaustive procedure. Content adaptation must accurately reflect the cultural and contextual reality of the target group, ensuring that the items are understandable and culturally appropriate for the new population (Abad et al., 2011; Muñiz et al., 2013).
The objective of this tutorial is to provide a guide for the adaptation and validation of psychometric instruments in psychological assessment, focusing on evidence of validity and reliability. The article outlines procedures and criteria for calculating reliability, evaluating validity evidence using current psychometric standards, and guiding researchers in ensuring that the instruments they use to evaluate psychological attributes and constructs are accurate and reliable (Elosua & Egaña, 2020; Haladyna & Rodriguez, 2013).
Adaptation of the Psychological Questionnaires
Test adaptation has advanced significantly over the past 25 years. This procedure, applied across various fields, follows several guidelines, such as those proposed by the International Test Commission (2017). These guidelines, outlined in 18 directives, represent a comprehensive effort to ensure a rigorous test adaptation process.
The first section, preconditions, involves researchers making initial decisions prior to the translation and adaptation process. The second section, test development, focuses on adapting or translating the instrument. The third phase, confirmation, involves analyzing equivalence, reliability, and validity evidence based on empirical data collected from the new context to which the questionnaire is being adapted. Subsequent phases include administration, scoring scales and interpretation, and documentation (International Test Commission, 2017).
This adaptation process is a complex effort that goes beyond mere item translation. Strict adherence to these guidelines is essential to ensure the subsequent validation analyses of the questionnaire within the target population.
Confirmatory Factor Analysis
Confirmatory Factor Analysis (CFA) is conducted after a theoretical structure for the test has been proposed. CFA evaluates whether the empirical data fit the predefined factorial structure and confirms if the proposed model aligns with the observed data (Luo et al., 2019). Through CFA, expected factor loadings, factor correlations, and other model parameters can be verified.
This analysis is crucial for validating the proposed factorial structure and ensuring that the test accurately and reliably measures the intended constructs (Muñiz, 2018).
Confirmatory Factor Analysis (CFA) is a widely used tool for questionnaire validation (Marsh et al., 2014; Montoya & Edwards, 2021; Taasoobshirazi & Wang, 2016). This analysis assesses model fit using various indices, including the Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), Root Mean Square Error of Approximation (RMSEA), and Standardized Root Mean Square Residual (SRMR).
Each index provides insights into different aspects of model fit. The recommended thresholds for a well-fitting model are typically CFI ≥ 0.90, TLI ≥ 0.90, RMSEA ≤ 0.06, and SRMR < 0.08 (Hu & Bentler, 1999; Tucker & Lewis, 1973; Browne & Cudeck, 1993; Hu & Bentler, 1999) (see
Table 1 and Supplementary
Table 1).
Relative fit index
The Relative Fit Index (RFI) is a measure that compares the fit of the proposed model to a null model (a model in which all variables are uncorrelated). The RFI aims to evaluate how well the proposed model fits compared to a reference model, balancing the simplicity of the model with its explanatory power (Hutchinson & Olmos, 1998).
The RFI ranges from 0 to 1, with values close to 1 indicating a good model fit, while values below 0.90 generally suggest that the model may not fit the data adequately. The RFI accounts for the model's degrees of freedom, penalizing unnecessarily complex models.
In summary, the RFI is useful for assessing whether a simpler model might be preferable to one that, while fitting well, introduces complexity that is not proportional to the improvement in fit. The formula is presented below.
Comparative Adjustment Index (CFI)
The Comparative Fit Index (CFI) is a widely used metric in structural equation modeling to evaluate the quality of the proposed model's fit. The CFI compares the evaluated model to a null model, which assumes no relationships between variables. This index adjusts for differences in sample size and model complexity, providing a measure of how much the proposed model improves fit compared to the null model (Marsh et al., 2014).
A CFI value of ≥ 0.90 is considered indicative of good model fit. This cutoff suggests that the proposed model significantly improves the fit compared to the null model, though it may not be optimal. Values above 0.95 indicate excellent fit, suggesting that the proposed model aligns very well with the observed data and offers a robust representation of the relationships among variables.
These recommendations are based on the work of Hu and Bentler (1999), who established these thresholds to guide researchers in interpreting CFI in the context of model evaluation. The formula is presented below.
Tucker-Lewis Index (TLI)
The Tucker-Lewis Index (TLI), also known as the Non-Normed Fit Index (NNFI), is a key metric in Confirmatory Factor Analysis (CFA). Like the Comparative Fit Index (CFI), the TLI assesses the fit of the proposed model by comparing it to a null model. However, the TLI has the distinct feature of penalizing overly complex models, making it particularly useful for promoting parsimony in model evaluation (Cai et al., 2023; Shi et al., 2019).
A TLI value of ≥ 0.90 is considered acceptable, indicating that the proposed model provides an adequate fit relative to the null model, adjusted for model complexity. Higher values suggest better fit, with scores above 0.95 generally interpreted as indicative of excellent fit. This penalty for complexity helps avoid overfitting and encourages models that are both parsimonious and representative of the data. The formula is presented below.
Normed Fit Index (NFI)
The Normed Fit Index (NFI) evaluates the fit of a proposed model in comparison to a null model, where all variables are assumed to be unrelated. The NFI value ranges from 0 to 1, with values closer to 1 indicating a better fit. A value of 0.90 or higher is generally accepted as indicative of good fit.
The NFI directly compares the proposed model to the null model and can be sensitive to sample size, particularly in studies with small samples (Goretzko et al., 2024; Zheng & Bentler, 2024). The formula is presented below.
Relative Fit Index (RFI)
The Relative Fit Index (RFI) compares the fit of a model with a null model, similar to the NFI, but takes into account the complexity of the model and the covariance structure between variables. The RFI is interpreted on a scale from 0 to 1, where values closer to 1 indicate a better relative fit of the proposed model compared to the null model. This index can be useful for avoiding overfitting and considering more parsimonious models (Van Laar & Braeken, 2022). The formula is presented below.
Absolute fit index
The Absolute Fit Index refers to a category of indices that directly evaluate how well a proposed model fits the observed data, without making comparisons to other models, as relative fit indices would. Absolute fit indices measure the quality of the model’s fit based on the discrepancy between the observed covariance matrix and the covariance matrix estimated by the model. These indices are essential because they provide a direct assessment of the fit without involving model complexity or comparison with other models (Goretzko et al., 2024).
Goodness of fit index (GFI)
It is an absolute fit measure that evaluates the proportion of variance and covariance explained by the proposed model in relation to the observed data. It is similar to a coefficient of determination (R
2) in regression, and its value ranges from 0 to 1, where values close to 1 indicate a good model fit. A value of 0.90 or higher is generally considered acceptable. The GFI is sensitive to sample size, and in studies with small samples, it may underestimate the model fit (Wang et al., 2020). The formula is presented below.
Adjusted goodness of fit index (AGFI)
It is an adjusted version of the GFI that takes into account the complexity of the model, penalizing models that include more parameters than necessary. The AGFI adjusts the GFI value based on the model's degrees of freedom, so models with more parameters (more complex) receive a penalty. Like the GFI, the AGFI also ranges from 0 to 1, with values close to 0.90 considered indicative of a good fit. The AGFI is more useful when there is a need to balance model fit with its parsimony, that is, achieving a good fit with the fewest possible parameters (Maassen et al., 2023). The formula is presented below.
Root Root Mean Square Error of Approximation (RMSEA)
The Root Mean Square Error of Approximation (RMSEA) is an index used to evaluate the discrepancy between the proposed model and the observed data in confirmatory factor analysis (CFA). Unlike other indices that assess model fit in comparative terms, RMSEA measures the absolute fit error and provides an estimate of how well the model fits the data in terms of average error per degree of freedom (Browne et al., 1993; Hu & Bentler, 1995).
An RMSEA value ≤ 0.06 is considered indicative of a good model fit. This threshold suggests that the discrepancy between the model and the data is small, implying that the model adequately represents the relationships among the observed variables. RMSEA values less than 0.08 are generally acceptable, indicating that the model has a reasonably good fit, though not as precise as lower values. These criteria help researchers interpret the quality of model fit and decide whether the proposed model is suitable for the data. The formula is presented below.
Standard Square Root Mean Residual (SRMR)
The Standardized Root Mean Square Residual (SRMR) is an index that evaluates the discrepancies between the observed covariances and the covariances predicted by the model in confirmatory factor analysis (CFA). Unlike other indices that compare model fit with a null model or adjust for model complexity, SRMR provides a direct measure of absolute fit by quantifying the average standardized difference between the observed and estimated covariances (Hu & Bentler, 1999).
An SRMR value < 0.08 is considered indicative of a good fit. This threshold suggests that the model has a relatively low discrepancy between the observed and predicted covariances, indicating that the model fits the data well. Higher values may suggest issues with model fit, and it may be necessary to review and adjust the model to improve its ability to represent the relationships in the data. The formula is presented below.
Parsimony fit index
Parsimony Fit Indices focus on evaluating the balance between a model’s fit and its simplicity. A parsimonious model is one that achieves a good fit with the least number of parameters, avoiding overfitting by including unnecessary parameters. These indices penalize models that, while they may fit the data well, are overly complex.
Parsimony Goodness of Fit Index (PGFI)
The index is an extension of the GFI, which adjusts according to the parsimony of the model. This exponent incorporates a penalty for model complexity, so that a model with more parameters, even if it has a good fit, will receive a lower score if it is unnecessarily complex. PGFI values are generally lower than GFI values. A PGFI value ≥ 0.50 is considered indicative of a good balance between fit and parsimony (Thakkar, 2020). The formula is presented below.
Parsimony Normed Fit Index (PNFI)
The Parsimony Normed Fit Index (PNFI) is an adjusted version of the NFI that penalizes model complexity. The PNFI introduces a correction that takes into account the degrees of freedom of the model, encouraging simpler models that achieve a good fit. Like the NFI, PNFI values vary between 0 and 1. A PNFI value ≥ 0.50 is considered acceptable and reflects a good fit with parsimony (Marsh et al., 2005). The formula is presented below.
Parsimony Comparative Fit Index (PCFI)
Parsimony Comparative Fit Index (PCFI) is an adjusted version of the CFI, which also penalizes model complexity. Like the PNFI, the PCFI considers the degrees of freedom of the model and favors those that achieve a good fit with fewer parameters. A PCFI value ≥ 0.50 is considered adequate and reflects a good balance between model fit and simplicity (Maassen et al., 2023). The formula is presented below.
Likelihood-ratio χ2 /degree of freedom
Chi-Square Ratio (χ2) and the degrees of freedom (χ2/df) is an index commonly used to evaluate the parsimony of the model. It is calculated by dividing the value of the chi-square test by the degrees of freedom of the model. A lower value indicates a more parsimonious model. Generally, a value of (χ2/df) less than 3 is considered indicative of good fit and parsimony, although some authors allow values up to 5 depending on the context (Van Laar & Braeken, 2022).
Reliability Indicators
The reliability of a test refers to the consistency of scores obtained on different occasions, or under different conditions. A reliable test produces stable and consistent results, meaning that scores should be similar if the test is administered again under equivalent conditions (Kalkbrenner, 2024, Viladrich et al., 2017; Zijlmans et al., 2018; Zimmerman, 2007). The main reliability indicators used to assess this consistency include (see
Table 2).
Alfa de Cronbach (α)
It is one of the most commonly used indexes to measure the internal consistency of a test. Cronbach's alpha evaluates the average correlation between the items of a test and provides a measure of the homogeneity of the items. A value of α ≥ 0.70 is considered acceptable, although higher values (≥ 0.80 or ≥ 0.90) are preferable, indicating higher internal consistency (American Psychological Association, 2022; Doval et al., 2023; McNeish, 2017; Viladrich et al., 2017). Cronbach's alpha is calculated using the following formula:
Where
is the number of items in the test,
is the variance of the item
i, y
is the total variance of the test. While the ordinal Cronbach's alpha is calculated using the following formula:
Where k is the number of items in the test, σ_i^2 is the variance of the item i, y σ_t^2 is the total variance of the test. While the ordinal Cronbach's alpha is calculated using the following formula:
Where is the number of items y is the average polychoric correlation between items.
Coeficiente de Fiabilidad de Congenericidad (ρ)
The coefficient is a generalization of Cronbach's alpha coefficient and is used when items are not tau-equivalent (i.e., they do not have the same variance or the same relationship to the construct). The formula is as follows:
Where are the factor loadings of the items and are the variances of the errors.
McDonald's Omega coefficient (ω)
This index is an alternative to Cronbach's alpha and provides a more accurate estimate of reliability when items are assumed to measure several underlying factors. The omega coefficient is based on factor analysis models and takes into account the dimensional structure of the test, providing a more robust measure of internal consistency. Values of ω ≥ 0.70 indicate good reliability (McDonald, 1999; Viladrich et al., 2017). The omega coefficient is calculated using the following formula:
Where
is the variance explained by item
i (squared factor loadings) y
is the residual variance of the item
i. While the ordinal omega coefficient is calculated using the following formula:
Where are the standardized factor loadings of the items on factor, are the error variances of the items, and is the total number of items.
Coeficiente de fiabilidad Omega Jerárquico ()
The Hierarchical Omega reliability coefficient (
), which is used in hierarchical factor models. This coefficient assesses the internal consistency of a test when items are grouped into subscales or first-order factors. Unlike the alpha coefficient, which assumes that all items are tau-equivalent, the Hierarchical Omega allows a more accurate assessment in contexts where items have different factor loadings. Its formula is:
Where are the factor loadings of the items on the overall factor and are the variances of the item errors.
Lambda de Guttman (λ)
Guttman's lambda is an index that assesses reliability in terms of the stability of scores on different administrations of the test. Unlike Cronbach's alpha, which assumes a unidimensional model, Guttman's lambda can be adapted to more complex models. Values of λ ≥ 0.70 suggest good reliability (Benton, 2015; Guttman, 1945; Motallebzadeh, 2023). The Guttman lambda is calculated with the formula:
Where σ_i^2 is the variance of the item i y σ_t^2 is the total variance of the test.
Each of these indicators offers a different perspective on the consistency of test scores and may be useful in different contexts and for different types of instruments. Using a combination of these indices provides a more complete and robust assessment of test reliability.
Evidence of Validity
Validity refers to how well a test measures what it purports to measure. Among the main validity indicators are the average variance extracted (AVE), the Heterotrait-Monotrait index (HTMT) (Salessi & Omar, 2019).
Convergent Validity: Average Extracted Validity (AVE)
Average Extracted Validity (AVE) is an index used to assess the convergent validity of a construct in a measurement model. AVE quantifies the amount of variance in the items that is explained by the construct in question. A high AVE value indicates that a large proportion of the variance in the items is attributable to the construct it is intended to measure, suggesting that the construct is being measured effectively and accurately (Fornell & Larcker, 1981; Rönkkö & Cho, 2022). The formula for calculating the AVE is as follows:
Where is the square of the loading of item i on the construct, is the variance of the item error i and is the total number of items in the construct.
For a construct to have adequate convergent validity, the variance must explain at least 50% of its component items (Fornell & Larcker, 1981). Therefore, the cut-off point for the AVE is ≥ 0.50. Due to the difficulty of certain constructs to capture such variance, it has been proposed to include as admissible those constructs that explain at least 37.5% of the variance, therefore, the lax cut-off point would be located at ≥ 0.375 (Moral de la Rubia, 2019).
Discriminant Validity: Index HTMT
The Heterotrait-Monotrait Ratio (HTMT) is a measure used to assess the discriminant validity between constructs in a measurement model. The HTMT examines the relationship between items of different constructs (heterotrait) compared to the relationship between items within the same construct (monotrait). In other words, the HTMT assesses the difference in correlation between items measuring different constructs versus items measuring the same construct.
An HTMT value of less than 0.85 suggests that the constructs are sufficiently differentiated from each other, indicating good discriminant validity. If the HTMT value is greater than 0.85, it could indicate a lack of differentiation between the constructs, suggesting that the constructs may be overlapping (Henseler et al., 2015). The formula for calculating the HTMT is:
Where the mean of the heterotrait correlations is the mean of the correlations between items of different constructs and the mean of the monotrait correlations is the mean of the correlations between items of the same construct. A HTMT value of less than 0.85, as a cutoff point, indicates good discriminant validity, suggesting that the constructs are sufficiently differentiated from each other (Henseler et al., 2015).
Measurement Invariance
Measurement invariance refers to the consistency in the factor structure of an instrument when applied to different groups (Van De Schoot et al., 2015). This aspect ensures that the number and type of factors, as well as the relationships between items and factors, are equivalent in the groups compared. To assess measurement invariance, confirmatory factor analysis (CFA) is performed on each group to confirm that the items load on the same factors in all samples (Marsh et al., 2014). This type of invariance is crucial to ensure that the instrument measures the same construct in different populations (Maassen et al., 2023; Schmitt & Kuljanin, 2008).
The classic levels of factorial invariance have been widely used to assess the equivalence of measurements across groups. Configural invariance tested whether the basic factorial structure, i.e., the pattern of relationships between items and factors, remained consistent across groups without imposing additional parameter constraints. Subsequently, metric invariance evaluated whether factor loadings were equivalent across groups, ensuring that items consistently reflected their relationship with latent factors. At a more advanced level, scalar invariance examined the equality of intercepts, guaranteeing that scores reflected differences in latent factors rather than measurement biases. Finally, strict invariance, the most rigorous stage, assessed whether measurement errors were homogeneous across groups, enabling precise and noise-free comparisons in estimates.
In recent research, new forms of invariance were explored to address the limitations of traditional approaches. Rutkowski and Svetina (2017) proposed invariance in hierarchical models, which allowed the evaluation of factorial equivalence in nested data, such as students within schools. They also introduced partial invariance as a solution for scenarios where full invariance was unattainable, permitting some parameters to vary across groups without compromising overall comparability. On the other hand, Xu and Soland (2024) investigated predictive invariance, which assessed whether relationships between latent factors and external variables were consistent across groups, thus integrating predictive validity into the analysis. Approximate invariance was also adopted, tolerating minor deviations in parameters and offering greater flexibility in culturally or linguistically diverse contexts.
Load invariance
Loading invariance examines whether the factor loadings of items are equal across different groups. This type of invariance verifies that each item has the same impact on the underlying factor, regardless of the group to which the test is applied (Kim & Yoon, 2011; Rutkowski & Svetina, 2017; Xu & Soland, 2024). To assess this invariance, we impose restrictions on the factor loadings in the model and compare the fit of the restricted model with an unrestricted model. The loading invariance ensures that the items have an equivalent influence on the factors across the samples studied (Kim et al., 2017). The formula is presented below.
Where is the response observed in the item i, is the factor loading of the item i in the factor j, is the latent value of the factor j y is the residual error of the item i. The criteria of loading invariance, part of the stepwise factor analysis, ensures that the relationships between items and factors are equivalent across groups. It is assessed after establishing a configural model, which verifies the same factor structure. (CFI > 0.90, RMSEA < 0.08, SRMR < 0.08) without restrictions. In the metric model, the factor loadings () are constrained to be equal between groups, and is considered achieved if the change in fit indices is acceptable (ΔCFI ≤ 0.01, ΔRMSEA ≤ 0.015, ΔSRMR ≤ 0.030). This step is crucial to validate the comparability of item-factor relationships before moving on to stricter invariance such as scalar (includes intercepts) and strict invariance (includes residuals).
Variance Inflation Factor (VIF)
VIF is a measure used in regression analysis to detect multicollinearity between predictor variables. It indicates how much the variance of a regression coefficient increases due to correlation between independent variables. The formula is presented below.
Where is the coefficient of determination of the regression of the variable i on the other predictor variables. The Variance Inflation Factor (VIF) is an indicator used to assess the presence of multicollinearity among predictor variables in a regression model.
A VIF close to 1 indicates no multicollinearity, meaning the independent variable is not highly correlated with the others. This suggests that the regression coefficients can be estimated accurately and without significant distortion.
On the other hand, a VIF greater than 5 or 10 signals the presence of problematic multicollinearity, implying that a variable is strongly correlated with others in the model. In such cases, coefficient estimation becomes unstable, increasing variance and reducing the reliability of results. To mitigate this issue, techniques such as removing redundant variables, combining correlated variables, or using regularization methods like ridge regression can be applied.
Intercept Invariance
Intercept invariance focuses on whether the starting points of the items in the measurement scale are the same in different groups. That is, it assesses whether the relationship between item and factor scores is consistent across groups. To assess this invariance, restrictions are imposed on the intercepts of the items and the fitted model is compared with one that does not have these restrictions. Intercept invariance is essential to ensure that differences in scores between groups are not due to biases in item starting points (De Roover, 2021). The formula is presented below.
Where is the response observed in the item i, is the factor loading of the item i in the factor j, is the latent value of the factor j, is the intercept of the item i and is the residual error of the item i.
The intercept or scalar invariance criteria is assessed by comparing the metric model, which constrains factor loadings (, with a scalar model that also constrains the intercepts ( so that they are equal between groups. It is considered achieved if the change in the overall fit indices is acceptable. (ΔCFI ≤ 0.01, ΔRMSEA ≤ 0.015, ΔSRMR ≤ 0.010). The process involves fitting both models and assessing the comparability of the fit indices to ensure that differences between groups reflect only variations in the latent scores and not in the items themselves.
Invariance of Residue
Residual invariance determines whether the residual variances (measurement errors) are equal across groups. This type of invariance ensures that the amount of variance in the items that is not explained by the factor is consistent across groups. To assess residual invariance, the model is adjusted to equalise measurement errors across groups and compared to models that do not have these restrictions. Residual invariance is crucial to ensure that differences in results are not due to differences in residual variability between groups (Putnick & Bornstein, 2016). The formula is presented below.
Where is the response observed in the item i, is the factor loading of the item i in the factor j, is the latent value of the factor j, is the intercept of the item i and is the residual error of the item i, whose variance (Var ()) is restricted in this step.
The criteria for assessing residual invariance include the analysis of the global fit indices, where an acceptable change in the ΔCFI ≤ 0.01, ΔRMSEA ≤ 0.015, y ΔSRMR ≤ 0.010. For the model comparison, the scalar model, which restricts factor loadings and intercepts, is evaluated against the strict model, which further restricts the variances of the residual errors. If the changes in the fit indices are minimal, it is concluded that the residual invariance is achieved, indicating that the measurements are fully equivalent between the groups.
Temporal Invariance
Time invariance refers to the stability of the instrument's properties over time. A test should maintain the same structure and psychometric properties when administered at different points in time. This ensures that changes in observed scores over time reflect changes in the measured construct and not variations in the properties of the instrument (Li et al., 2018). The formula for time invariance can be expressed similarly to factorial invariance, but in a longitudinal context, considering the different points in time
t1, t2, ….
tn in which the same constructs are measured.
Where is the observed score on the item i in time t, is the factor loading of the item i in the factor j in time t, is the latent factor in time t, is the intercept of the item i in time t, y is the residual error over time t.
Criteria for assessing time invariance include the analysis of global adjustment indices, where acceptable changes are considered in ΔCFI ≤ 0.01, ΔRMSEA ≤ 0.015, y ΔSRMR ≤ 0.010. For model comparison, an initial temporal model, which may have minor restrictions, is compared with more restrictive models that impose equality of factor loadings, intercepts and residual errors over time. If the changes in the fit indices are minimal, it is concluded that time invariance has been achieved, indicating that the measurements are equivalent over time.
Structural invariance
The observed differences in the correlations between factors indicate that structural invariance does not hold, suggesting that the relationships between factors vary significantly across the different groups analysed (Sass & Schmitt, 2013). Structural invariance is essential to ensure that the latent relationships between the underlying constructs, represented by these factors, are consistent across groups. When this condition is not met, it implies that the underlying associations between the assessed dimensions or latent variables vary across groups, which may compromise the validity of intergroup comparisons. In a robust analysis, structural invariance is essential to assert that factors are comparable in terms of their relationships and magnitudes across different populations or subgroups (Kang et al., 2016; Rogers, 2024). The formula for structural invariance is expressed by structural equation modelling, similar to factorial invariance, but focusing on the relationships between latent factors (structural paths) and not only on loadings or intercepts.
Where are the latent factors, is the regression coefficient representing the structural relationship between the factors, and is the specification error.
Model Interpretation
Differences in correlations between factors could reflect that different groups interpret or respond to the items that make up the scale in distinct ways (Loewenthal & Lewis, 2020; Reise et al., 1993; Streiner et al., 2024). This may signal that the factors not only have different meanings for each group but also that their relevance or weight within the model varies. For example, in the context of mental health, two factors that are highly correlated in one population may not show the same correlation in another, indicating differences in how psychological constructs manifest or interact within those groups. Such discrepancies could suggest that the underlying psychological or conceptual structure of the factors differs between groups, potentially influenced by variables such as cultural context, personal experiences, or socioeconomic differences.
Modeling and Comparisons
When differences in correlations between factors are observed, making direct comparisons of factor scores between different groups can be problematic and potentially misleading (Christensen & Golino, 2021). If the correlations between factors are not homogeneous, it indicates that the underlying factor model is not structurally invariant. In such a case, this means that comparisons between groups may not be valid, as the relationships between the constructs are not equivalent. Such discrepancies can lead to misinterpretations of the observed differences in factor scores, as they may reflect structural differences rather than true differences in the measured characteristics. Furthermore, the lack of structural invariance complicates the ability to conduct more advanced analyses, such as multi-group or causal path analyses, as the interpretations may be biased (Morin et al., 2020).
The existence of significant differences in correlations between factors suggests the need for a critical reevaluation of the proposed model. One option could be to explore alternative models that allow for a more flexible structure adapted to the observed differences between groups (Leitgöb et al., 2023). It may also be necessary to conduct further research to identify the underlying reasons for these differences in correlations, which could include a deeper analysis of the characteristics of the subgroups and their contexts. In some cases, it might be useful to adjust the model to reflect the lack of structural invariance, which could involve introducing specific parameters for each group or identifying subgroups that exhibit particular correlation patterns. These modifications would not only improve the model's accuracy but also provide a better understanding of the dynamics influencing different groups, facilitating a more appropriate and contextually relevant interpretation of the results.
Methods for Evaluating Invariance
To evaluate invariance, advanced statistical techniques such as Confirmatory Factor Analysis (CFA) (Kenny, 1976; Marsh et al., 2014; Rogers, 2024) and Structural Equation Modeling (SEM) (Jorgensen et al., 2020; Kline, 2011; Oberski & Satorra, 2013; Rosseel, 2012; 2020; Vispoel et al., 2024) are used. In the context of CFA, different hierarchical invariance models (configural, metric, scalar, and residual) are compared to determine if there are significant differences between groups. The process includes configuring the model, assessing metric invariance, scalar invariance, and residual invariance, thereby ensuring that the instrument is valid and reliable across different contexts and time points.
Psychometric validation is essential to guarantee the accuracy and relevance of assessment instruments in psychology. This process must begin with translation and cultural adaptation to ensure the instrument fits the target population as closely as possible. Psychometric property validation assesses the reliability and validity of the instrument, as well as the model fit, to achieve the effective application of psychometric tools in psychological research.
Conclusions
This article has reviewed the main theoretical and methodological aspects of adapting and validating psychometric instruments in the field of psychology, emphasizing the importance of rigorous evaluation at each phase of the process. The use of exploratory and confirmatory factor analysis, along with internal consistency and validity indicators such as Cronbach's alpha, omega, AVE, HTMT, CFI, TLI, RMSEA, and SRMR, is crucial to ensure that a test is not only reliable but also valid. Simultaneously, considering these indicators provides a comprehensive view of the instrument's structure and quality, allowing for the identification of possible areas for improvement and ensuring accurate assessment.
It is assumed that the correct linguistic and cultural adaptation of the instrument has been carried out to ensure that the items are relevant and understandable in the new cultural context. This process involves not only accurate translation but also the adaptation of the assessed concepts to reflect the realities and experiences of the target cultural group, minimizing the risk of biases or misinterpretations. In this way, the process of cross-cultural validation, along with psychometric validation, will enhance the utility and applicability of the instruments in diverse populations, promoting equity in psychological assessment.
In summary, applied psychometrics enables psychological instruments to be relevant and accurate for assessing different constructs in diverse populations. By ensuring the reliability and validity of measurements, the quality of the collected data is assured, which in turn strengthens the validity of the results obtained, both in research and clinical evaluation. This methodological approach is essential for advancing evidence-based psychology, where measurement accuracy is key for making well-founded clinical or educational decisions and for developing public policies that respond to the real needs of the populations assessed.
Recommendations
The use of multiple psychometric indicators is essential for a comprehensive validation of psychological assessment instruments. Reliability and validity are complementary aspects that should be addressed with a multidimensional approach. The use of Cronbach's alpha and omega as reliability indicators provides an overview of a test's internal consistency, while validity should be evaluated with measures such as AVE (Average Variance Extracted) and HTMT (Heterotrait-Monotrait Ratio), which assess convergent and discriminant validity. Furthermore, fit indices such as CFI (Comparative Fit Index), TLI (Tucker-Lewis Index), RMSEA (Root Mean Square Error of Approximation), and SRMR (Standardized Root Mean Square Residual) are crucial to evaluate whether the proposed model fits the data appropriately. By employing these multiple indicators, a completer and more robust picture of the psychometric properties of the instrument is obtained, avoiding the limitations of relying solely on one measure.
Confirmatory Factor Analysis (CFA) is an indispensable tool in the psychometric validation of instruments. While Exploratory Factor Analysis (EFA) is useful in the early stages to identify possible underlying structures, CFA allows testing more specific hypotheses about the factorial structure of the data. The recommendation to use both approaches provides stronger validation, as EFA allows for open exploration of factors, while CFA validates the structure found with additional data or different samples. Statistical programs such as SPSS, R, Mplus, Jasp, or Jamovi offer robust tools that enable modeling complex data and conducting the necessary tests to ensure factorial validity.
It is essential to consider the recommended values for model fit indicator cut-offs. For CFI and TLI, values above 0.90 are considered indicative of a good fit, although ideally, they should exceed 0.95. For RMSEA, values below 0.08 indicate an acceptable fit, while values below 0.05 reflect an excellent fit. For SRMR, a value below 0.08 is interpreted as a good fit. Ensuring that models meet these criteria is crucial for rigorously validating the structure of instruments, guaranteeing that the proposed theoretical model fits the sample data optimally.
It is recommended to consider the International Test Commission (ITC) when adapting and using psychological tests in different cultural and linguistic contexts. The ITC promotes the ethical and scientific use of assessments globally, ensuring that tests maintain their validity and reliability. The organization has developed guidelines that address key issues such as cultural equivalence, ensuring that assessments are fair and appropriate for the populations they are applied to. For professionals working with international tests, following the ITC's recommendations is essential.
It is also advisable to follow the guidelines of the American Psychological Association (APA), a globally influential organization in psychology and psychometrics. The APA provides ethical and scientific standards that are used by psychologists worldwide, not just in the United States. Its committees and divisions offer guidelines on research and test adaptation, helping ensure that assessment instruments are culturally sensitive and valid. The APA guidelines are widely used in the adaptation of psychological and educational tests.
It is also recommended to consider the standards promoted by the European Federation of Psychologists' Associations (EFPA). This organization groups psychology associations in Europe and works on harmonizing psychological assessment standards in the region. The EFPA's psychological assessment committee focuses on the proper adaptation of tests in different European countries, promoting the development of common guidelines to ensure the validity and reliability of assessments in various cultural contexts. For professionals working in Europe or with European populations, following the EFPA's recommendations is essential.
Another relevant organization is the International Association for Educational Assessment (IAEA), particularly recommended for those involved in educational assessment. Although its primary focus is education, the IAEA also works on the adaptation of psychometric tests to be used in various educational systems and cultures. The organization promotes research to ensure that assessment instruments are culturally appropriate and valid. For those involved in educational assessment in an international context, it is crucial to consider the guidelines and studies promoted by the IAEA.
Finally, it is recommended to consider the products and guidelines of The Psychological Corporation, part of Pearson, one of the leading publishers in the creation and adaptation of psychological and educational tests. Pearson has a long history in the standardization and adaptation of tests for use in different cultural contexts. When working with assessment tools internationally, it is important to ensure that the instruments used meet validity and reliability standards in each country, which Pearson guarantees through its rigorous adaptation and standardization process.
Limitations
One of the main limitations of this tutorial is its scope, as it is designed as an introductory guide and does not cover all possible methodological variations or contexts in which confirmatory factor analysis (CFA) can be applied. This could limit its applicability in studies requiring advanced analyses or in very specific cultural and linguistic contexts, where methodological adaptations need to be more detailed.
Additionally, the tutorial focuses on the validation of psychological instruments through CFA, but it does not address other complementary techniques that could also be relevant for evaluating psychometric properties, such as exploratory factor analysis (EFA), structural equation modeling (SEM), or methods for handling non-normal data or small samples. This might lead researchers to overlook other options that could be more suitable for their particular studies.
Another limitation is that the tutorial does not use specific, simplified examples to illustrate the steps of the analysis. Finally, the tutorial does not delve deeply into ethical or quality-related aspects of adapting and validating questionnaires in different populations, such as ensuring appropriate sample representation or the cultural sensitivity of items. This could be a challenge for researchers working with diverse or vulnerable populations.
Future Studies
Since this tutorial aims to be a practical guide, future studies should focus on the creation of more specialized complementary materials, such as advanced tutorials that include multigroup analysis, cross-validation, and handling of complex data. It would also be useful to develop guides tailored to specific contexts, such as the validation of instruments in minority populations or in longitudinal studies.
Additionally, practical guides addressing the use of specific software for performing confirmatory factor analysis would be valuable, comparing the advantages and limitations of platforms like JASP, AMOS, Mplus, R (with packages such as lavaan), or Jamovi. This could help researchers choose the most suitable tool according to their needs.
Finally, future studies could explore the integration of emerging technologies, such as artificial intelligence, to automate parts of the psychometric validation process, optimizing time and precision in analysis. This would be particularly relevant for researchers with limited resources or those working in contexts with high demands for data analysis.
Cross-cultural validation remains a critical area of research in psychometrics. Instruments developed in a specific cultural context may not have the same relevance or meaning in other cultures, posing the need to conduct studies that compare the underlying psychometric structures of tests across different populations. This not only helps ensure the validity of instruments in diverse contexts but also enables the establishment of norms that are globally applicable. Cross-cultural studies provide valuable information on the universality or cultural specificity of psychological constructs, which in turn promotes the creation of more inclusive measures adapted to cultural diversity.
The exploration of new psychometric techniques is another area that should continue to evolve. Traditionally, classical test theory has dominated the field of psychometric evaluation, but more advanced techniques such as Item Response Theory (IRT) and Structural Equation Modeling (SEM) have proven to be powerful tools for improving the accuracy and validity of instruments. These techniques allow, for example, the identification of items that function differently in distinct subgroups and a more accurate evaluation of the latent structure of an instrument. Additionally, these techniques offer greater flexibility in data analysis, especially in situations where the assumptions of classical theory are not fully met.
Longitudinal research is another area that has been overlooked in many psychometric studies. Most studies rely on cross-sectional designs, which limit our understanding of the stability and consistency of instruments over time. Future research should focus on conducting longitudinal validations (at least longitudinal cross-sectional) to assess how measurements behave at different time points and under various circumstances. This not only provides data on the temporal reliability of the instrument but also allows for the examination of changes in the constructs measured as individuals or populations evolve.
The use of emerging technologies also offers new possibilities for psychometrics. The growing prevalence of digital platforms and online psychological assessment apps necessitates the adaptation and validation of instruments for use in virtual environments. Although these technologies offer opportunities for large-scale test administration, they also introduce challenges related to accessibility, data security, and comparability with traditional methods. However, the implementation of computerized adaptive tests and the use of artificial intelligence to analyze response patterns represents a significant opportunity to optimize the precision and efficiency of psychological assessment in digital settings.