Enhancing Race and Ethnicity using Bayesian Imputation in an All Payer Claims Database

36 Background: All Payer Claims Databases (APCD) are a rich source of health information, 37 however, race and ethnicity (R&E) data are largely missing. Bayesian Improved Surname 38 Geocoding (BISG) is a common R&E imputation method, yet, validation of BISG in APCDs is 39 lacking. We used the BISG to impute missing R&E in the Oregon APCD. 40 Methods: BISG imputed R&E for Asian Pacific Islanders (API), Blacks, Hispanics and Whites 41 were contrasted to the gold standard (vital statistics) and sensitivity and specificity improvements 42 were assessed. Logistic regression examined whether missing R&E was random across patient 43 characteristics. 44 Results: Among 85,857 individuals in the study, 32.1% (n=27,594) had missing R&E. Missing 45 R&E was not randomly distributed. There were higher odds of missingness among males, 46 Whites, those age 65 and older, and commercially insured individuals. Differences in the percent 47 missing were also found by co-morbid conditions and mortality causes. Imputing the missing 48 R&E with BISG method improved the sensitivity to identify White, Black, API, and Hispanics. 49 Conclusions: APCDs can benefit from enhancing missing R&E with BISG imputation to 50 perform more robust population-health level analyses and identify inequities according to R&E 51 without losing power or dropping non-random records with missing R&E data. regression was used to assess factors associated with increased odds of missing race and ethnicity in the APCD. Validation of the various race/ethnicity sources and improvements in sensitivity were compared against the gold standard vital statistics.

In the everchanging healthcare landscape, data are required to develop new approaches to 58 improve healthcare quality, efficiently use resources, and analyze system performance. To meet 59 this growing need, states are increasingly mandating through legislation that commercial and 60 public payers providing insurance plans in their state submit data to All-Payer Claims Data 61 (APCD) databases that include medical claims, pharmacy claims, insurance enrollment data, 62 provider information, and dental claims [1]. To date, many states have existing (mandated or 63 voluntary), under implementation or strong interest in APCDs [2]. States can use APCDs for 64 many purposes, including to improve health system performance, assess the impact of policy 65 changes, understand key cost and utilization drivers, monitor population health trends, develop 66 interventions, and conduct research [3,4]. 67 APCDs are a rich source of clinical information that have great potential for providing the data 68 needed to comprehensively address long-standing race and ethnic inequities in the quality of 69 healthcare delivery at the population level [5]. This potential is constrained, however, by the lack 70 of reliable race and ethnicity information in APCDs. Despite long-standing recommendations 71 from the Institute of Medicine, the National Quality Forum, and others that health plans 72 systematically collect race and ethnicity data, implementation has been slow due to privacy 73 concerns and resource limitations [6][7][8][9]. Even Medicare and Medicaid plans, which are federally 74 mandated to collect race and ethnicity data by the Affordable Care Act [10], continue to struggle 75 to collect this information completely [11]. 76 To address the ongoing challenges regarding limited race and ethnicity data in administrative 77 datasets, Elliott and colleagues developed the Bayesian Improved Surname Geocoding (BISG) 78 imputation method [12]. BISG estimates the probability that an individual is a member of a given racial or ethnicity category based on their surname and their address. Surname analysis is 80 conducted using the U.S. Census Surname list, which provides common surnames for racial and 81 ethnic groups based on information collected from the decennial census [13] policies to address health disparities in quality of care and patient outcomes, the validation of 97 reliable methods to estimate race and ethnicity in APCDs is an important public health issue.

98
The aim of this study is to validate and use BISG to estimate distribution of patient 99 characteristics according to race and ethnicity in Oregon's APCD. First, differential missingness 100 of race and ethnicity in the APCD is assessed by patient characteristics. Then, the ability of 101 BISG to accurately impute missing race and ethnicity for Asian/Pacific Islander (API), Black, Hispanic, and White populations was examined using Oregon death certificates as the "gold 103 standard" [19].  The initial sample contained 105,240 individuals who had a vital statistics record and were 120 linked with the APCD. Individuals with missing race/ethnicity in vital statistics data were 121 excluded (n=197 (0.19%)).

Vital statistics race/ethnicity
A combined vitals race/ethnicity variable was created and included: American Indian/Alaska 124 Native (AIAN), Asian Pacific Islander (API), Black, Hispanic (any race), White, and "other" 125 race group. AIAN individuals and individuals with more than one race were placed in the "other 126 race" category due to the limitations of low imputation accuracy of BISG for AIAN and multi- To estimate the optimal probability thresholds to create binary BISG race/ethnicity variables, the 148 Youden Index optimal cut off [25] was applied which maximizes the difference between the true 149 positives and the false positives on the receiver operating characteristic (ROC) curve. The

150
Youden index produced overlapping cutoffs for Whites (0.44) and Blacks (0.48) where some 151 individuals would be predicted to belong to both White and Black race. Therefore, discriminative 152 thresholds of at least 0.5 were used to assign patients to a single race/ethnic group. Sensitivity 153 analyses between a 0.5 and 0.75 cutoff were conducted based on previous studies [17,26]. A 154 total of 98.4% of the study population met the starting cutoff of 0.5 for the assignment of race 155 ethnicity, while only 83.1% met criteria at the 0.75 threshold. In the sensitivity analyses, there 156 was a considerable decrease in sensitivity and specificity when the 0.75 is applied compared to 157 the 0.5. Based on this, the 0.5 cutoff was used to assign BISG race/ethnicity. There were 141 158 records for which BISG imputation was undefined, i.e. none of the race categories reached the 159 0.5 cutoff, these were excluded. BISG probabilities were not imputed for individuals without a 160 geocoded address (missing address completely or partially, PO boxes). There were 19,045 161 records with both missing APCD self-reported and BISG imputed race and ethnicity. These 162 records were excluded from the initial sample resulting in a total sample of 85,857 individuals.

163
Enhanced APCD race/ethnicity 164 BISG imputed probabilities were used to assign race/ethnicity for APCD records where it was 165 missing.

Analysis variables
To assess the differential missingness of race and ethnicity in the APCD records, an indicator 168 was created of whether APCD race ethnicity was missing. Patient characteristics included age at 169 death, gender, payer, year of death, comorbidities and the top ten causes of death. Because age at 170 death from the vital statistics was used, the age distribution is skewed towards older ages;

175
Comorbidities present in 10% of the sample or more are reported.

176
The top ten causes of death are based on the national "leading causes of death" report [30].

177
Causes of death were identified using ICD-10 from the underlying causes of death including  Missing APCD race and ethnicity data were not randomly distributed across patient 204 characteristics. The logistic regression results showed multiple factors that were significantly 205 associated with the odds of missing APCD self-reported race and ethnicity. For instance, a 206 gradual increase in the frequency of missing race and ethnicity with increasing age was 207 observed. Those in the 85 years of age and older age category had almost 4-fold increased odds 208 of missing race and ethnicity (adjusted odds ratio (aOR) 3.74, 95% confidence level (CI) 3.51-209 3.98) compared to patients less than 65 years of age. Males were also more likely to be missing Medicaid enrollees, individuals with commercial insurance had a considerably higher likelihood 214 of missing APCD race and ethnicity (aOR 43.8, This is the first study to our knowledge to examine the ability of the BISG algorithm to 231 accurately impute race and ethnicity when missing in an APCD. Findings suggest that 232 missingness of race and ethnicity variables in the APCD is common. Moreover, missingness 233 does not occur equally; it is more likely among males, Whites, older and commercially insured individuals. There were also significant differences between missing and non-missing records in 235 representation of comorbid conditions and mortality cause. This study found that imputing the 236 missing race/ethnicity with the BISG method greatly improved the sensitivity to detect White, 237 Black, API, and Hispanic groups.

238
Missing race/ethnicity in administrative data sets is a known issue [18,26,31]. The APCD data 239 set, by including most of a state's population, can be valuable to answer many important health 240 questions. However, with a high proportion of the data set missing race/ethnicity, the ability of 241 APCD-based analyses to assess health outcomes is limited. Enhancing APCD data with indirect 242 imputation based on surname and address can allow for more robust population health studies.

243
This is particularly important as excluding records with missing race/ethnicity may introduce 244 bias, especially if that missingness is not random as observed in this study. For instance, 245 Grundmeier and colleagues simulated not-randomly missing race/ethnicity data to compare the 246 association of race/ethnicity with pediatric health outcomes when race/ethnicity is imputed vs. 247 excluded. They found that imputing missing data with BISG reduces bias compared to when 248 only data with non-missing values are used [32]. This study found similar results.

249
Compared to other validation studies, findings from this study show that BISG alone was less 250 optimal to estimate race and ethnic categories correctly, particularly for Blacks (low sensitivity 251 and specificity). For example, Adjaye-Gbewonyo et al. found that BISG captured 71.8% of 252 Blacks while in this study the sensitivity was less optimal at 58% only [33]. This is likely driven 253 by the racial composition of the sample studied. This study is drawn from a mostly White state to attribute deterministic probabilities for the racial make-up of a community resulting in less 274 accurate predictions. This study showed that using the BISG method to "enhance" the existing 275 race/ethnicity information is more suitable than relying on BISG predictions alone especially in 276 less diverse and more integrated communities.

278
This study has a number of limitations. The use of death data means that the sample for this 279 study skews older than the Oregon population, in which the race and ethnic distributions may be different. Using a more representative data source with respect to age and that includes a more 281 reliable capture of AIAN race is warranted. Because of data use agreement restrictions, the 282 voluntary APCD used in this study did not include Medicare fee-for-service patients, however,