On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

: Missing item responses are prevalent in educational large-scale assessment studies like the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians advocated a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. This article uses the PISA 2018 mathematics dataset to investigate the consequences of different missing data treatments on country means. Obtained country means can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model ﬁt than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model ﬁt. Second, we argue that model ﬁt should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons. imputation, ignorability, nonignorable


Introduction
It has frequently been argued that measured student performance in educational largescale assessment (LSA; [1,2]) studies is affected by test-taking strategies. In a recent paper that was published in the highly-ranked Science journal, Pohl et al. [3] argued that "current reporting practices, however, confound differences in test-taking behavior (such as working speed and item nonresponse) with differences in competencies (ability). Furthermore, they do so in a different way for different examinees, threatening the fairness of country comparisons." [3]. Hence, the reported student performance (or, equivalently, student ability) would be confounded by a "true" ability and test-taking strategies. Importantly, the authors question the validity of country comparisons that are currently reported in LSA studies and argue for an approach that separates test-taking behavior (i.e., item response propensity and working speed) from a purified ability measure. The core of the Pohl et al. [3] approach is on how to model missing item responses. In this article, we systematically investigate the consequences of different treatments of missing item responses in the programme for international student assessment (PISA) study conducted in 2018.
While the treatment of missing data in statistical analyses in social sciences is now widely used [4,5], in recent literature, there are recommendations for treating missing item responses in item response theory (IRT; [6]) models in LSA studies [7,8]. Typically, the treatment of item responses can be distinguished between calibration (computation of item parameters) and scaling (computation of ability distributions).
It is essential to distinguish the type of missing item responses. Missing item responses at the end of the test are referred to as not reached items while missing items within the test are denoted as omitted items [9]. Since the PISA 2015 study, not reached items are no longer scored as incorrect, and the proportion of not reached items is used as a predictor in the latent background model [10].
Some psychometricians repeatedly argued that missing item responses should never be scored as wrong because such a treatment would produce biased item parameter estimates and unfair country rankings [3,7,8,11,12]. In contrast, model-based treatments of missing item responses that rely on latent ignorability [3,7,8,13] are advocated. Missing item responses can be ignored in this approach when including response indicators and a latent response propensity [14,15]. Importantly, the missingness process is summarized by the latent response variable. As an alternative, multiple imputation at the level of items can be employed to handle missing item responses properly [16,17]. However, the scoring of missing item responses as wrong has been defended for validity reasons [18][19][20]. Moreover, simulation studies cannot inform about the proper treatment of missing item responses [19,21].
Although the proposals of using alternative scaling models for abilities in LSA studies like PISA have been made, previous work either did not report country means in the metric of interest [7] such that consequences cannot be interpreted, or constituted only a toy analysis consisting only a few countries [3] that did enable a generalization to operational practice. Therefore, this article tries to compare different scaling models that rely on different treatments of missing item responses. We use the PISA 2018 mathematics dataset as a showcase. We particularly contrast the scoring of missing item responses as wrong with model-based approaches that rely on latent ignorability [3,7,8] and a more flexible Mislevy-Wu model [22,23] containing the former two models as submodels. In the framework of the Mislevy-Wu model, it is tested whether the scoring of missing item responses as incorrect or treating them as latent ignorable are preferred in terms of model fit. Moreover, it is studied whether the probability of responding to an item depends on the item response itself (i.e., nonignorable missingness, [5]). In the most general model, the missingness process is assumed to be item format specific. Finally, we investigate the variability across means from different models for a country.
The rest of the article is structured as follows. In Section 2, the sample of persons and items in PISA 2018 is characterized. In Section 3, the different scaling models and the linking procedure are described. In Section 4, the results are presented. Finally, the paper closes with a discussion in Section 5.

Sample
The mathematics test in PISA 2018 [24] was used to investigate different treatments of missing item responses. We included 45 countries that did receive the main test. These country did not receive test booklets with items of lower difficulty that were included for low-performing countries. In total, 70 items were included in our analysis. Seven out of 70 items were partial credit items with a maximum score of two. For simplicity, these partial credit items were dichotomized (i.e., dichotomously scored as correct with a score of two for the partial credit item). In total, 27 items had the multiple-choice (MC) format, and 43 items had constructed-response (CR) format. For 18 MC items, the guessing probability was 1/4, for 4 MC items, it was 1/8, and for 5 MC items, it was 1/16.
Students from booklets 1 to 12 were selected. Those booklets had mathematics items included in two out of four item clusters. Mathematics items appeared either at the first and second or the third and fourth positions in the test.
In Table 1, descriptive statistics for the sample used in our analysis are presented. In total, 167,092 students from these 45 countries were included in the analysis. On average, M = 3713.2 students were available in each country. The average number of students per item within each country ranged between 415.8 (MLT, Malta) and 4408.3 (ESP, Spain). On average, M = 1120.3 students per item were available at the country level.

Analysis
As stated above, all polytomous items were dichotomously scored for simplicity. Let X pi denote the dichotomous item responses and the R pi response indicators for person p and item i. The response indicator takes the value one if X pi is observed. Consistent with the operational practice since PISA 2015, the two-parameter logistic (2PL) model [25] is used for scaling item responses [10,24]. The item response function is given as where Ψ denotes the logistic distribution function. The item parameters a i and b i are item discriminations and difficulties, respectively. It holds that 1 − Ψ(x) = Ψ(−x). The latent ability θ p follows a normal distribution. If all item parameters are estimated, the mean of the ability distribution is fixed to zero and the standard deviation is fixed to one. The one-parameter logistic (1PL, [26]) model is obtained if all item discriminations are set equal to each other. In our analysis, the scalings are carried out separately for each country c. That is, one obtains country-specific item parameters a ic and b ic : Sampling weights were always when applying the scaling model (2) to the PISA 2018 dataset. To enable the comparability of the ability distribution across countries, the obtained item discriminations a ic and item difficulties b ic are transformed on a common in a subsequent linking step (see Section 3.2) for details.

Different Scaling Models for Handling Missing Item Responses
In this subsection, we describe the different scaling models used for determining country means. These models differ concerning the missingness mechanism assumptions of missing item responses.

Scoring Missing Item Responses as Wrong
In a reference model, we scored all missing item responses (omitted and not reached items) as wrong (model UW). In the literature, it is frequently argued that missing item responses should never be scored as incorrect [3,7,11,27]. However, we think that the arguments against the incorrect scoring are flawed, and simulation studies cannot show the inadequacy of the UW model (see [19][20][21]).

Scoring Missing Item Responses as Partially Correct
Missing responses for MC items can be scored as partially correct. The main idea is that a student could guess the MC item if s/he does not know the answer. If an item i has K i alternatives, a random guess of an item option would provide a correct response with probability 1/K i . In IRT estimation, one can weigh probabilities P(X pi = 1) with 1/K i Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 6 October 2021 doi:10.20944/preprints202110.0107.v1 Note. N = number of students; I = number of items; N item = average number of students per item; M OECD = officially reported country mean by OECD [24]; M OECD = officially reported country standard deviation by OECD [24]; M stand = standardized country mean (M = 500 and SD = 100 in total population); %NA = proportion of item responses with missing data; %NR = proportion of item responses that are not reached; %NA CR = proportion of constructed-response item responses with missing data; %NA MC = proportion of multiple-choice item responses with missing data; Missing item response rates larger than 10.0% and smaller than 5.0% are printed in bold. Missing rates for not reached responses larger than 3.0% are printed in bold. See Appendix A for country labels.

of 18
and P(X pi = 0) with 1 − 1/K i [28]. This weighing implements a scoring of a missing MC item as partially correct (model UP). The maximum likelihood estimation is replaced by a pseudo-likelihood estimation that allows non-integer item responses [28]. The estimation was conducted in the R [29] package sirt [30]. Pseudo-likelihood estimation of IRT models that allow non-integer item responses is not widely implemented in IRT software. However, the partially correct scoring can be alternatively implemented by employing a multiple imputation approach of item responses. For every missing item response of item i, a correct item response is imputed with probability 1/K i . In our analysis, we created 10 imputed datasets to reduce the simulation error associated with the imputation. We stack the 10 multiply imputed datasets into one long dataset and applied the 2PL scaling model (see Equation (2)) for the stacked dataset (see [31][32][33]). The stacking approach does not result in biased item parameter estimates [32], but resampling procedures are required for obtaining correct standard errors [31]. In this article, we mainly focused on differences between results from different models and did not implement resampling procedures for computing standard errors.
Missing item responses for CR items are scored as incorrect in the partially correct scoring approach because unknown answers cannot be simply guessed by students in this situation.

Treating Not Reached Items as Ignorable
Since PISA 2015, not reached items are no longer scored as wrong [10]. To investigate this scaling method, we ignored not reached items in the scaling model but scored omitted items as incorrect (model UN1). We also implemented the operational practice since PISA 2015 [10] that includes the proportion of not reached item response as a covariate in the latent background model (model UN2; [9,34]). This second model is equivalent to latent ignorability when the response indicators for not reached items follow a 1PL model.

Treating Missing Item Responses as Ignorable
In model UO1, all missing item responses are ignored in the scaling model. The student ability θ is extracted based on the observed item responses only. The method is valid if missing item responses can be regarded as ignorable [12]. In this case, the probability of omitting items only depends on observed items and not the unobserved item responses.

Treating Missing Item Responses as Latent Ignorable
A weak variant of nonignorable missing data is latent ignorability [13,[35][36][37][38][39][40][41][42][43]. Observed item responses X pi and response indicators R pi are jointly modeled conditional on the latent ability θ p and the latent response propensity ξ p . The probability of responding to an item is given by (model MO2; [7,14,[44][45][46]) The 2PL model holds for item responses (see Equation (1)). A joint bivariate distribution (θ p , ξ p ) is modeled. In this study, a bivariate normal distribution is assumed, where SD(θ p ) is fixed to one, and SD(ξ p ), as well as Cor(θ p , ξ p ), are estimated (see [47] for more complex distributions). The model UO1 (see Section 3.1.4) that presupposes ignorability (instead of latent ignorability) can be tested as a nested model within model MO2 by setting Cor(θ p , ξ p ) = 0. This model is referred to as model MO1.
The model for response indicators R pi in Equation (3) is a 1PL model. Hence, the sum score R p• = ∑ I i=1 R pi is a sufficient statistic for ξ p . Instead of estimating a joint distribution (θ p , ξ p ), a conditional distribution θ p |R p• can be specified in a latent background model. In our study, the proportion of missing item responses is used as a predictor in the latent background model (model UO2, [9]).
The models MO1 and MO2 are also used for generating multiply imputed datasets. Conditional and θ p , missing item responses are imputed according to the response proba-bility from the 2PL model (see Equation (1)). The stacked imputed dataset (see Section 3.1.2) can be scaled with the unidimensional 2PL model. If models MO1 or MO2 would be the true data-generating models, results from multiple imputation (i.e., IO1 and IO2) would coincide with model-based treatments (i.e., MO1 and MO2). However, results can differ in the case of misspecified models [48,49].

Mislevy-Wu Model for Nonignorable Item Responses
Latent ignorability characterizes only a fragile nonignorable missing data process. It might be more plausible that the probability of responding to an item depends on the observed or unobserved item response itself [50][51][52][53][54]. The so-called Mislevy-Wu model [22,55] extends the model MO2 (Equation (3)) that assumes latent ignorability to In this model, the probability of responding to an item depends on the latent response propensity ξ p and the item response X pi itself (see [18,19,30,[56][57][58]). Model MM1 is defined by assuming a common δ i parameter for all items. In model MM2, two δ parameters are estimated for item formats CR and MC. For both models, multiply imputed datasets were also created based on conditional distributions P(X pi |R pi , θ p , ξ p ). The scaling models based on stacked imputed datasets are referred to as IM1 and IM2.
The most salient property of the models MM1 and MM2 is that the model treating missing item responses as wrong (model UW) can be tested by setting δ i = −10, resulting in a response probability of approximately one ( [23]; see also [59]). This model is referred to as model MW and the corresponding scaling model based on imputations as IW. Moreover, the model MO2 assuming latent ignorability is obtained by setting δ i = 0 for all items i. It has been shown that model selection among models MW, MO2, and MM1 can be satisfactorily conducted utilizing information criteria [23].

Imputation Models Based on Fully Conditional Specification
The imputation models discussed above are based on unidimensional or two-dimensional IRT models. Posing such a dimensionality reduction might result in invalid imputations because almost all IRT models in large-scale assessment studies are misspecified [20]. Hence, two alternative imputation models for missing item responses were considered that relied on fully conditional specification (FCS; [32]) implemented in the R package mice [60].
Previous research indicated that item parameters are affected by position effects [61][62][63][64][65][66][67][68]. Hence, the FCS imputation is separately conducted for each test booklet. In the imputation model IF1, only item responses were included. Linear regression with predictive mean matching (PMM; [32]) was used as the imputation model. In each iteration and for each imputation model, the predictors (item responses except the item that is imputed) are transformed using ten factors obtained from partial least squares regression to avoid the curse of dimensionality due to estimating too many parameters in the regression models [69,70].
In model IF2, response indicators were additionally included [71]. In contrast to the Mislevy-Wu model, for imputing item response X pi , the set of predictors X pj , R pj (j = i) were used. Hence, the probability of responding to an item is not allowed to depend on the item itself. This assumption might be less plausible than assuming the response model in Equation (4). Like in model IF1, ten factors from partial least squares regression were used for reducing the dimension of the covariate space in the conditional imputation models and PMM was utilized.
Like for all multiple imputations in our study, 10 imputed datasets were created, and the 2PL scaling model is applied to the stacked dataset involving all imputed datasets (see

Linking Procedure
The scaling models described above resulted in country-specific item discriminations a ic and item difficulties b ic . To enable a comparison of country means, the corresponding ability distributions can be obtained by linking approaches that establish a common ability metric [72,73]. In this article, Haberman linking [74] in its original proposal is used (see also [75,76]). The outcome of the linking procedure are country means and standard deviations. To enable a comparisons across the 19 specified different scaling models, the ability distributions were linearly transformed such that the total population involving all students in all countries in our study has a mean M = 500 and a standard deviation SD = 100.

Model Comparisons
It is of particular interest whether the Mislevy-Wu model (MM1 and MM2) outperforms other treatments of missing item responses such as the scoring as wrong (model MW) and latent ignorable (models MO1 and MO2). The Bayesian information criterion (BIC) is used for conducting model comparisons ( [23]; see also [24,[77][78][79] for similar model comparisons in PISA, but [80][81][82] for improved information criteria in complex surveys). Moreover, the Gilula-Haberman penalty (GHP; [83,84]) is used as an effect size that is relatively independent of the sample size and the number of items. A difference in GHP larger than 0.001 is declared a notable difference in model fit [84,85].

Similarity of Scaling Models
Each of the 19 scaling models provided a set of country means. For each country, the absolute difference of the means stemming from two models can be computed. Table 2 summarizes the average absolute differences. Scaling models that resulted in an average absolute difference of at most 1.0 can be considered similar. Table 2 indicates that the methods that treat missing item responses as wrong (UW, MW, IW) or treat MC items as partially correct (UP, IP) resulted in similar country mean estimates. Both methods that did not score not reached item responses as wrong (UN1, UN2) resulted in relatively similar estimates. The models that rely on ignorability (UO1, MO1, IO1) or latent ignorability (MO2, UO2, IO2) provided similar estimates. In line with previous research [12], the inclusion of the latent response propensity ξ did not result in strongly different estimates of country means compared to models that ignore missing item responses. The specifications of the Mislevy-Wu model (MM1, IM1, MM2, IM2) resulted in similar country means. Interestingly, country means from the Mislevy-Wu model were more similar to the treatment of missing item responses as incorrect than those that relied on ignorability or latent ignorability. Finally, the scaling model based on FCS imputation involving only item responses (IF1) was similar to the models assuming (latent) ignorability (UO1, MO1, IO1, MO2, UO2, IO2). FCS imputation involving item responses and response indicators different from the imputed item (IF2) were neither similar to the ignorability-based treatment nor the incorrect scoring method or the Mislevy-Wu model. This finding could be explained by the fact that the imputation method IF2 is based on strongly opposing assumptions of the missingness mechanism than the Mislevy-Wu model.

Model Comparisons
From Table 3, we can see that for the majority of countries (35 out of 45), the IRT model treating missing item responses as incorrect (model MW) provided a better model fit in terms of BIC than modeling it with a latent propensity (model MO2). For 39 out of 45 countries, the Mislevy-Wu model with item-format specific ρ parameters (model MM2) was preferred. In 5 out of 45 countries, the Mislevy-Wu model with one common ρ parameter (MM1) was the best-fitting model. Only in one country (MYS), the model treating missing item responses as wrong had the best model fit.    For 29 out of 45 countries, the proposed Mislevy-Wu model outperformed the suggested model with a latent response propensity in terms of a GHP difference of at least .001. Overall, these findings indicated that the models assuming ignorability or latent ignorability performed worse in terms of model fit compared to scaling models that acknowledge the dependence of responding to an item from the true but occasionally unobserved item response.

Country-Specific Model Parameters for Latent Ignorable and Mislevy-Wu Model
Now, we present findings of model parameters characterizing the missingness mechanism from the model MO2 relying on latent ignorability and the Mislevy-Wu model MM2. The parameters are shown in Table 4. The SD of the latent response propensity SD(ξ) was somewhat lower in the Mislevy-Wu model (MM2, with a median Med = 1.98) than the model assuming latent ignorability (MO2, Med = 1.93). Moreover, by additionally including the latent item response as a predictor for the response indicator, the correlation Cor(θ, ξ) between the latent ability θ and response propensity ξ was slightly lower in model MM2 (Med = .43) than MO2 (Med = .46). Most importantly, the missingness mechanism strongly differed between CR and MC items. The median δ parameter in model MM2 for CR items was −2.61, indicating that students that did not know the item had a higher probability of omitting the item even after controlling for the latent response propensity ξ. In contrast, the median δ parameter was −0.48. Hence, there was a smaller influence of (latently) knowing the item with the response indicators. However, it was different from zero for most countries, indicating that the model MO2 assuming latent ignorability did not adequately explain the missingness mechanism. Overall, it can be seen that model parameters strongly vary across countries. Hence, it can be concluded that assuming different missingness mechanisms for countries could have non-negligible consequences for country rankings (see [86]).

Country Means Obtained From Different Scaling Models
For comparing country means, 11 out of 19 specified scaling models were selected to contrast the dissimilarity of country mean estimates. Table 5 shows the country means of these 11 different treatments of missing item responses. The country rank (column "rk UW ") serves as the reference for the comparison among methods. Moreover, the interval of country ranks obtained from the different methods are shown in column "rk Int ". The average maximum difference in country ranks was 2.4 (SD = 1.8) and ranged between 0 (SGP, HKG, EST, DEU, LUX, BIH) and 8 (IRL). The range in country means (i.e., the difference of the largest and smallest country mean of the 11 methods) was noticeable (M = 5.0) and showed strong variability between countries (SD = 2.8, Min = 1.5, Max = 12.5). Interestingly, large range values were obtained for countries with missing proportions that were strongly below and above the average missing proportion. For example, Ireland (IRL) had a relatively low missing rate of 5.8% and reached rank 15 with method UW (M = 505.2) that treated missing item responses as incorrect. Methods that ignore missing item responses resulted in a lower country mean (UO1: M = 499.9; MO2: M = 500.7; IO2: M = 500.0). In contrast, the Mislevy-Wu model (MM2 and IO2)-which also takes the relation of the response indicator and the true item response into account-resulted in higher country means (MM2: M = 505.1; IO2: M = 504.9). Across the 11 estimation methods, Ireland reached ranks between 15 and 23 which can be considered a large variability. Moreover, the range of country means for Ireland was 8.2, which is two to three times higher than standard errors for country means due to the sampling of students in PISA. Italy (ITA, rank 26; M = 492.0) that had a relatively high missing rate of 12.

Discussion
In this reanalysis of the PISA 2018 mathematics data, different scaling models with different treatments of missing item responses were specified. It has been shown that differences in country means across models can be substantial. The present study sheds some light on the ongoing debate about how to properly handling missing item responses in educational large-scale assessment studies. Ignoring missing item responses and treating them as wrong can be seen as opposing strategies. Other scaling models can be interpreted to provide results somewhere between these two extreme poles of handling missingness. We argued that the Mislevy-Wu model contains the strategy of incorrect scoring and the latent ignorable model as submodels. Hence, these missing data treatments can be tested. In our analysis, it turned out that the Mislevy-Wu model fitted the PISA data best. More importantly, the treatment of missing item responses as incorrect provided a better model fit than ignoring them or modeling them by the latent ignorable model that has been strongly advocated in the past [7,8]. It also turned out that the missingness mechanism strongly differed between CR and MC items.
We believe that the call for controlling for test-taking behavior in the reporting in large-scale assessment studies such as response propensity [3] using models that also include response times [87,88] poses a threat to validity because results can be simply manipulated by instructing students to omit items they do not know [20]. Notably, missing item responses are mostly omissions for CR items. Response times might be useful for detecting rapid guessing [56,[89][90][91][92]. However, it seems likely that students who do not know the solution to CR items do not respond to these items. In this case, the latent ignorability assumption is unlikely to hold, and scaling models that rely on it (see [3,9]) will result in conflated and unfair country comparisons.
In this article, we only investigated the impact of missing item responses on country means. In LSA studies, missing data is also a prevalent issue for student covariates (e.g., sociodemographic status; see [69,[93][94][95][96][97]). As covariates also enter the plausible value imputation of latent abilities through the latent background model [34,98] or relationships of abilities and covariates are often of interest in reporting, missing data on covariates is also a crucial issue that needs to be adequately addressed [69].
It could be argued that there is not a unique scientifically sound or publicly accepted scaling model in PISA. The uncertainty in choosing a psychometric model can be reflected by explicitly acknowledging the variability of country means obtained by different model assumptions. This additional source of variance associated with model uncertainty [99][100][101][102][103][104] can be added to the standard error due to the sampling of students and linking error due to the selection of items [105]. The assessment of specification uncertainty has been discussed in sensitivity analysis [106] and has recently become popular as multiverse analysis [107,108] or specification curve analysis [109,110]. As educational LSA studies are policy-relevant, we think that model uncertainty should be included in statistical inference.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.  Note. %NA = proportion of item responses with missing data; %NR = proportion of item responses that are not reached; rk UW = country rank from model UW; rk Int = interval of country ranks obtained from 11 different scaling models; Aver = average of country means across 11 models; SD = standard deviation of country means across 11 models; rg = ange of country means across 11 models; UW = scoring as wrong (Sect  10.0% and smaller than 5.0%, not reached proportions larger than 3.0%, country rank differences larger than 2, ranges in country means larger than 5.0. See Appendix A for country labels.