Development and Validation of a Primary Care Electronic Health Record Phenotype to Study Migration and Health in the UK

International migrants comprised 14% of the UK’s population in 2020; however, their health is rarely studied at a population level using primary care electronic health records due to difficulties in their identification. We developed a migration phenotype using country of birth, visa status, non-English main/first language and non-UK-origin codes and applied it to the Clinical Practice Research Datalink (CPRD) GOLD database of 16,071,111 primary care patients between 1997 and 2018. We compared the completeness and representativeness of the identified migrant population to Office for National Statistics (ONS) country-of-birth and 2011 census data by year, age, sex, geographic region of birth and ethnicity. Between 1997 to 2018, 403,768 migrants (2.51% of the CPRD GOLD population) were identified: 178,749 (1.11%) had foreign-country-of-birth or visa -status codes, 216,731 (1.35%) non-English-main/first-language codes, and 8288 (0.05%) non-UK-origin codes. The cohort was similarly distributed versus ONS data by sex and region of birth. Migration recording improved over time and younger migrants were better represented than those aged ≥50. The validated phenotype identified a large migrant cohort for use in migration health research in CPRD GOLD to inform healthcare policy and practice. The under-recording of migration status in earlier years and older ages necessitates cautious interpretation of future studies in these groups.


Background
In 2020, international migrants comprised 14% of the United Kingdom (UK) population [1]. Conditions prior to, during and after migration expose individuals to a range of health risks, resulting in differences in health outcomes between migrants and nonmigrants in the migrant's country of arrival [2]. In the UK, there are well-established multi-generational minority ethnic communities but a history of 'hostile' migration policies [3]. The study of migrant health is therefore needed to complement the study of ethnic inequalities to understand how migration intersects with ethnicity, as well as its effects over and above ethnicity to shape risk factors for health, physical and mental health outcomes, and healthcare access [4].
While migrants' hospitalisation and mortality outcomes have been studied on a population level using electronic health records (EHRs) [5,6], primary care outcomes are scarcely investigated at this scale, despite often being the first point of contact with the UK health system and a central part of the National Health Service (NHS) strategy for preventive care [7]. Most studies examining primary care outcomes in UK migrants are qualitative or employ quantitative survey methods. When EHRs have been used, primary care registration data could only be linked to disease-specific migrant health datasets such as tuberculosis screening [8]. Linkage of census data has only been attempted in Northern Ireland for prescriptions outcomes [9]. Additionally, three studies, all conducted by Jain et al., identified migration status without the use of data linkages [10][11][12] within the Clinical Practice Research Datalink (CPRD), one of the largest UK primary care EHR resources. Using predominantly country of birth and language codes, they estimated that 1.3% of individuals aged ≥65 years in CPRD could be identified as international migrants [11]. However, with 67.7% of migrants in England aged between 16 and 64 years old at the time of the 2011 census [13], a large proportion of migrants at younger ages were not identified by these studies.
Thus, a valid migration phenotype, which is a transparent reproducible algorithm using clinical terminology codes [14], is needed to determine migration status for individuals of all ages using UK primary care EHRs in order to study a broad range of migration health outcomes. A migration phenotype should determine the migration status of a large number of individuals who use primary care and are representative of the UK migrant population. CPRD with its associated linked datasets is an ideal database to use in the development of this phenotype so that it can be used to study primary, secondary and tertiary healthcare utilisation, mortality and other health outcomes in migrants from European Union (EU) and non-EU countries.
We aimed to develop a migration phenotype for UK NHS primary care EHRs and assess its validity in individuals of all ages by describing completeness of recording of migration status, as well as representativeness compared to the Office for National Statistics (ONS) country of birth and 2011 census statistics.

Study Design
This is a study validating a migration phenotype for a population-based cohort study of migration health in the UK using linked EHRs. A flow diagram describing this validation study is shown in Figure 1. The protocol for this population-based cohort study was published previously [15]. Briefly, the protocol describes a study in which the validity of a migration phenotype will be assessed, and a main study to be completed if the phenotype is found to be valid. The main study involves applying the phenotype to the linked CPRD dataset to describe primary care and hospital-based healthcare utilisation and mortality in migrants compared with non-migrants. The protocol also describes how patient and public involvement provided guidance on the research priorities, which included preventable causes of inpatient admission, sexual and reproductive health conditions/interventions and mental health conditions.

Ethics and Approvals
This study is based in part on data from the Clinical Practice Research Datalink obtained under license from the UK Medicines and Healthcare products Regulatory Agency (MHRA). It was approved by the MHRA Independent Scientific Advisory Committee (ISAC protocol 19_062R) and carried out as part of the CALIBER programme [16]. The data were provided by patients and collected by the NHS as part of their care and support. The interpretation and conclusions contained in this study are those of the authors alone.

Bias
Misclassification may lead to differential bias where migration status is more likely to be recorded for individuals experiencing a specific outcome than those who do not. This could lead to a false association between migration and outcomes studied. We assessed bias by comparing the distribution of migrants recorded in CPRD GOLD to ONS population statistics and created categories of migration status to address differences in the level of certainty of classification across codes included in the phenotype.

Tools
Data were supplied by the CALIBER research team in multiple files and imported into R software for cleaning and analysis. All data cleaning and analysis code has been made available as open-source metadata.

Data Analysis
We counted the number of different terms used in the migration phenotype and in each category of terms described in Table 1 (including calculation of percentages where  relevant). We compared the list of terms to the Jain et al. study [11]. To assess completeness, we estimated the distribution by producing counts and percentages of migrants across the study period and at the time of the 2011 census by sex, year of birth, World Health Organization (WHO) region of birth, continent of birth, 13 CPRD practice region (classified by CPRD as 10 regions in England, with Scotland, Wales and Northern Ireland as separate regions) and ethnicity (18 category groupings, then further aggregated into the 6 higher-level groups of White British, White Non-British, Mixed, Asian/Asian British, Black/Black British, Other to address small group sizes; Table S1).

Data Resource
We extracted data from the CPRD GOLD January 2019 build, which comprises approximately 16 million individuals from 761 practices covering 3.53% of the UK population [17]. CPRD GOLD contains de-identified data of patients across a network of general practice's (GPs) across the UK that use Vision ® EHR software. This data source is broadly representative of the age, sex and ethnicity demographics of the UK general population [18].

Inclusion Criteria
We included individuals of all ages in CPRD GOLD between 1 January 1997 and 31 December 2018 whose record was of 'acceptable' research quality. This means CPRD has verified that individuals and their GP practice were contributing 'up-to-standard' data [18]. An individual was included at the latest of 1 January 1997, their current registration date or the date on which their GP practice started contributing up-to-standard data to CPRD GOLD. An individual was excluded at the earliest of 31 December 2018, the date their care was transferred out of a CPRD GOLD practice, the practice's last data collection date for CPRD, or the individual's date of death.

Development of the Migration Phenotype
We created the phenotype using a systematic approach previously developed from the CALIBER platform described elsewhere [19]. The phenotype was created in three stages (exploration, development and implementation) with feedback at each stage from a team of clinicians, computer scientists, epidemiologists, public health practitioners, bioinformatics and migration health experts.
We searched for Read V2 terms relating to international migration using the following: *migrant*, *migrat*, *countr*, *asylum*, *refugee*, *visa*, *abroad*, *born in*, *origin*, *illegal*, *language*, with the asterisk representing a wildcard search operator. The initial list of terms was reviewed and refined by two experts in migration health research. Each term was assigned a category ( Figure 1) based on the type of term ("visa status indicating migration to the UK", "main/first language not English", "country of birth outside of the UK", "non-UK origin") and a category based on the certainty of migration status ("definite", "probable", "possible"). Each individual was classified once using their highest certainty of migration category.

Outcomes
The following three outcomes were used: 1.
Migration phenotype: The total number of terms used in the migration phenotype.

2.
Completeness: The percentage of migrants recorded in CPRD for the whole study period, in each year and at the time of the 2011 census.

3.
Representativeness: The percentage of migrants in CPRD compared with annual ONS country of birth statistics [1], and the percentage of migrants in CPRD living in England and Wales on the date of the 2011 census (27 March 2011) compared with census data [20].

Bias
Misclassification may lead to differential bias where migration status is more likely to be recorded for individuals experiencing a specific outcome than those who do not. This could lead to a false association between migration and outcomes studied. We assessed bias by comparing the distribution of migrants recorded in CPRD GOLD to ONS population statistics and created categories of migration status to address differences in the level of certainty of classification across codes included in the phenotype.

Tools
Data were supplied by the CALIBER research team in multiple files and imported into R software for cleaning and analysis. All data cleaning and analysis code has been made available as open-source metadata.

Data Analysis
We counted the number of different terms used in the migration phenotype and in each category of terms described in Table 1 (including calculation of percentages where relevant). We compared the list of terms to the Jain et al. study [11]. To assess completeness, we estimated the distribution by producing counts and percentages of migrants across the study period and at the time of the 2011 census by sex, year of birth, World Health Organization (WHO) region of birth, continent of birth, 13 CPRD practice region (classified by CPRD as 10 regions in England, with Scotland, Wales and Northern Ireland as separate regions) and ethnicity (18 category groupings, then further aggregated into the 6 higherlevel groups of White British, White Non-British, Mixed, Asian/Asian British, Black/Black British, Other to address small group sizes; Table S1).
To assess representativeness, we compared the percentage of migrants in CPRD with annual ONS country of birth statistics [1,20] both visually/graphically and using the chi-squared test for proportions. Ratios were calculated of the proportion of migrants in CPRD compared to ONS country of birth statistics in each year between 2004 and 2018 (from 2004 onwards, ONS data are sectioned into periods January-December for a more consistent comparison across years) [11]. We also compared, visually/graphically and using the chi-squared test, the percentage of migrants in CPRD living in England on the date of the 2011 census with 2011 census data on country of birth [13] stratified by sex, age, geographical region of origin, and ethnicity. Ratios were calculated of the proportion of migrants in CPRD compared to ONS census data.
We conducted subgroup analyses based on the certainty of migration status (i.e., definite, probably, possible).

Migrant Phenotype
Four hundred and thirty-four terms indicating migration to the UK were identified from the Read Version 2 terminology system and are listed in Table S2. The majority of terms indicated country of birth outside of the UK (51.84%; 225 out of 434 terms) or having a non-English main or first language (42.16%; 183 out of 434 terms). The remaining terms related to visa status indicated migration to the UK (3.46%; 15 out of 434 terms) or a non-UK origin (2.53%; 11 out of 434 terms).
Sixty-seven read codes included by the previously mentioned studies of migration health in CPRD by Jain et al. were excluded as they were largely related to reading other languages [10][11][12]. The expert group discussed that preferred written language may not always correspond to a person's main/first language, so these terms were excluded from the present migration phenotype. A further 36 language-, country-of-birth-and originrelated terms were included in the present migration phenotype that were not included by Jain et al.

Completeness
Of the patients in CPRD between January 1997 and December 2018, 2.51% (403,768/ 16,071,111) had at least one term indicating migration to the UK ( Figure 2). 467,189 events indicating migration were coded across 403,768 individuals. Moreover, 44.3% of these 403,768 individuals were classified as "definite" migrants, 53.7% as "probable" migrants, and 2.05% as "possible" migrants. The most commonly coded migration-related events indicated a non-English first/main language 56.8%. The least commonly coded event was related to being of non-UK origin (2.73%). The percentage of migrants in CPRD GOLD increased from 0.20% in 1997 to 3.64% in 2018. Table S3 Table S4. codes that aligned with ONS Nomis continent of birth codes, the most common was the Middle East and Asia (12.5%) followed by Europe (12.4%) and Africa (5.86%). Distribution of sex and year of birth was consistent across certainty of migration status categories. However, ethnicity was better recorded in "probable" migrants with only 7.71% of unknown ethnicity compared to 28.8% of "definite" migrants with unknown ethnicity.

Representativeness
The percentage of patients recorded as migrants increased over time in CPRD GOLD by 4.6 times between 2004 (0.79%) and 2018 (3.64%) compared to the 1.6-fold increase in migrants as per ONS data over the same period (8.89% in 2004 to 14.2% in 2018; Figure 3). "Probable" migrants increased faster than the other two certainty categories, the "possible" certainty category remained poorly recorded throughout.
While the percentage of migrants in CPRD GOLD was consistently lower than in  Table 1 summarises the distribution of migrants in CPRD GOLD for the demographic factors of sex, year/decade of birth, ethnicity, region of birth, and primary care practice region. Just over half of migrants were female (53.7%) and the median year of birth was 1982 (IQR 1973(IQR -1990. The most common ethnicity amongst all migrants was White Non-British (34.3%) followed by Asian/Asian British (26.7%) and Black/Black British (9.2%). 42.4% of migrants in CPRD GOLD were registered with a London practice, and the proportion of patients in a region that were recorded as migrants was also highest in London (7.44%; Table S5).
Of the 140,423 patients with country of birth codes that aligned with a WHO region of birth, the most common was European Region (12.5%) followed by African Region (5.86%) and Western Pacific Region (4.36%). Of the 140,641 patients with country of birth codes that aligned with ONS Nomis continent of birth codes, the most common was the Middle East and Asia (12.5%) followed by Europe (12.4%) and Africa (5.86%).
Distribution of sex and year of birth was consistent across certainty of migration status categories. However, ethnicity was better recorded in "probable" migrants with only 7.71% of unknown ethnicity compared to 28.8% of "definite" migrants with unknown ethnicity.

Representativeness
The percentage of patients recorded as migrants increased over time in CPRD GOLD by 4.6 times between 2004 (0.79%) and 2018 (3.64%) compared to the 1.6-fold increase in migrants as per ONS data over the same period (8.89% in 2004 to 14.2% in 2018; Figure 3). "Probable" migrants increased faster than the other two certainty categories, the "possible" certainty category remained poorly recorded throughout.   While the percentage of migrants in CPRD GOLD was consistently lower than in ONS country of birth data (p < 0.0001), the ratio of the percentage of migrants recorded in CPRD compared ONS increased over time from 0.09 in 2004 to 0.26 in 2018 (Table S6). Migrants were under-recorded in CPRD compared to ONS 2011 census data in all age bands (Table S7), with the highest numbers recorded in age band 25-34 years (5.22% in CPRD and 25.2% in ONS) and lowest in the age band 85 years and older (0.64% in CPRD, 7.83% in ONS). Migrants aged 0-15 years were most well-recorded in CPRD (2.1% in CPRD, 5.8% in ONS, ratio = 0.41), while 85 years and older were the most poorly recorded group (0.64% in CPRD, 7.83% in ONS, ratio = 0.08).
Comparing the whole migrant cohort within CPRD GOLD and ONS 2011 census data ( Figure 4 and Table S8), differences are smallest across age bands between 16 and 49 years old, but greatest for the 0-15-year-old band and age bands above 50 years old. The proportion of females is similarly higher than males in both datasets (52.3% in CPRD and 51.6% in ONS).  Among the CPRD migrant cohort with known ethnicity, Asian/Asian British ethnicity was more frequently recorded than amongst non-UK born individuals in ONS census data (40.5% in CPRD and 32.6% in ONS) while the White British ethnic group was recorded less frequently (1.70% in CPRD and 12.6% in ONS). White British migrants in the ONS data are likely to reflect those born to British nationals living abroad, or those who identify as White British post-arrival to the UK [21]. The remaining ethnic groups had approximately similar proportions between datasets ( Figure 6). A comparison of ethnicity using the more granular 18 group classification (Table S9) resulted in small numbers, limiting the ability to draw definitive conclusions.  Among the CPRD migrant cohort with known ethnicity, Asian/Asian British ethnicity was more frequently recorded than amongst non-UK born individuals in ONS census data (40.5% in CPRD and 32.6% in ONS) while the White British ethnic group was recorded less frequently (1.70% in CPRD and 12.6% in ONS). White British migrants in the ONS data are likely to reflect those born to British nationals living abroad, or those who identify as White British post-arrival to the UK [21]. The remaining ethnic groups had approximately similar proportions between datasets ( Figure 6). A comparison of ethnicity using the more granular 18 group classification (Table S9) resulted in small numbers, limiting the ability to draw definitive conclusions.

Discussion
We developed and evaluated a phenotyping algorithm that identified over 400,000 migrants in CPRD GOLD. The vast majority of these were either "definite" migrants (codes indicating visa or a country of birth outside the UK) or "probable" migrants (codes indicating a first or main language that was not English). Migration status was underrecorded in CPRD GOLD compared to ONS data, particularly in individuals over the age of 50 years, but increased over the years to capture a quarter of the expected proportion of migrants by 2018. The distribution of sex and geographic region of birth was similar between migrants in CPRD GOLD and ONS datasets. Ethnicity was well-recorded in migrants in CPRD, however, the Asian/Asian British ethnic group was overrepresented compared to ONS data.
Several explanations may account for the lower number of migrants identified in CPRD compared with ONS data. Firstly, GPs do not routinely record migration-related information in EHRs. Recording may be limited to situations where, for example, an interpreter is needed, or differential health risks in a recent migrant's country of birth/origin

Discussion
We developed and evaluated a phenotyping algorithm that identified over 400,000 migrants in CPRD GOLD. The vast majority of these were either "definite" migrants (codes indicating visa or a country of birth outside the UK) or "probable" migrants (codes indicating a first or main language that was not English). Migration status was under-recorded in CPRD GOLD compared to ONS data, particularly in individuals over the age of 50 years, but increased over the years to capture a quarter of the expected proportion of migrants by 2018. The distribution of sex and geographic region of birth was similar between migrants in CPRD GOLD and ONS datasets. Ethnicity was well-recorded in migrants in CPRD, however, the Asian/Asian British ethnic group was overrepresented compared to ONS data.
Several explanations may account for the lower number of migrants identified in CPRD compared with ONS data. Firstly, GPs do not routinely record migration-related information in EHRs. Recording may be limited to situations where, for example, an interpreter is needed, or differential health risks in a recent migrant's country of birth/origin will affect clinical decision making. Secondly, barriers to primary care experienced by migrants, such as language, discrimination, lack of knowledge about services [22], and fear of data sharing for the purposes of immigration enforcement [3], could affect migrants' ability or willingness to register with an NHS GP practice. This corroborates findings of lower levels of primary-care registration amongst newly-arrived migrants to the UK [8] and undocumented migrants and asylum seekers making up a large proportion of patients attending non-NHS primary care [3]. The under-recording of migrants could thus represent a lower number of migrants registering with primary care services. Thirdly, barriers to primary care access could also result in lower attendance at consultations, thereby limiting the opportunity for a GP to ask questions on country of birth, language, or visa type. If there are more opportunities to code migration status with increasing time (and more appointments attended) since GP registration, migrants represented in CPRD GOLD may be those who have lived in the UK longer. As such, generalisability of the phenotype only extends to migrants who have registered with primary care, and they are less likely to be newly arrived migrants [8].
The improved recording of migration status over time, in younger age groups, and in certain ethnic groups could also be explained by healthcare provider coding behaviours or patient healthcare utilisation patterns. Improvements in coding migrant status over time could reflect the incentivising of GPs to record main/first language terms as part of the Quality Outcomes Framework between 2008-2011 [23]. These codes made up the majority of the migration phenotype, and the rate of increase in recording over time was faster in "probable" migrants (terms related to a non-English main/first language) than "definite" migrants. The better recording of migration in younger age groups may be explained by children having more routine contact with primary care unrelated to disease or illness, such as for childhood immunisations and developmental checks. Healthcare use at older ages related to chronic disease may not be as readily accessed by migrants. Older migrants may have migrated to the UK before EHRs existed or before clinical coding in EHRs was well established, and their migration status may not have been coded retrospectively. As a smaller proportion of older migrants are recorded as migrants in CPRD GOLD, there may be greater bias when studying health outcomes associated with older age groups. The better representation of migrants in the Asian/Asian British ethnic group could reflect a higher rate of consultations in this ethnic group as previously described in CPRD GOLD [24]. However, GPs could also deem migration to be more relevant to patients from an Asian/Asian British ethnic group, for example, due to assumptions made about language proficiency or specific health risks. Interpretation of findings should take this into account when analysing migration and ethnicity data using this phenotype. Potential sources of bias also affect this study, with the main limitation being misclassification of migration status. Migrants make up considerably less of the general population than non-migrants, and as a result, the percentage of migrants misclassified as non-migrants is likely to be low. This means that estimates of outcomes in the non-migrant group would be minimally influenced by misclassification, whereas estimates of the same outcomes in the migrant group may be influenced to a greater extent. This may occur in particular as a result of the inclusion of language terms in the phenotype. It may also result in selection bias in future studies of outcomes where migration status is more likely to be recorded for individuals experiencing the outcome versus those who do not experience it. Furthermore, the representativeness of CPRD GOLD practices serving migrants compared to all UK GP practices is unknown and may have affected the low percentage of migrants in CPRD in regions such as London (7%) where ONS estimates of Londoners born abroad are much higher (35%) [1]. Migrants are also likely to be more mobile than non-migrants within the UK; as CPRD cannot link an individual's record from multiple CPRD practices, migrants may be more likely than non-migrants to be incorrectly counted as more than one individual within the dataset. Significant variation exists between GP practices in their recording of patient sociodemographic indicators, and a more resource-intensive source of validation, such as a nationwide survey of GP practices, is needed to examine these issues further.
Other limitations of the phenotype include, firstly, the under-identification of older migrants aged 50 years and over. Whilst we identified 0.99% of individuals aged 65 years and over as migrants on the date of the 2011 English census (Table S7), Jain et al. identified 1.3% on the 1st of January 2013 [12]. The greater percentage of migrants identified by Jain et al. in this age group could be a result of improved recording of migrant status over time, as discussed previously, and also the inclusion of written language codes. The addition of these written language codes could be explored in the further development of phenotype certainty categories. Secondly, language codes also make up the "probable" category of migrants, likely over-identifying migrants from non-English speaking countries and under-identify migrants from English-speaking countries, subsequently underrepresenting economic migrants who have good English proficiency. Thirdly, aggregation of ethnic groups into six higher-level categories to deal with small group sizes in migrants loses granularity when comparing the CPRD migrant population with ONS statistics by ethnicity to assess representativeness.
The involvement of experts in migration health and CPRD to develop the migrant phenotype was a strength of this study. Compared to previous approaches, we included a further 36 relevant diagnosis terms indicating migration to create a more comprehensive phenotype. We categorised terms according to the certainty of migration status, allowing future studies to study migration health with varying degrees of certainty for how accurately the phenotype identifies migrant patients in CPRD GOLD. The specificity of the phenotype can be improved by omitting the "possible" migrant certainty category (defined by non-UK origin, making up only 2.1% of all migrants). As the proportion of migrants recorded in CPRD GOLD has improved over time, studying healthcare outcomes in more recent years may be of more value. The cohort in later years should be compared to the 2021 census as a matter of priority when these data become available.
The availability of a migration phenotype to identify migrants in CPRD GOLD will enable the study of important public health topics, such as primary care utilisation and sexual and reproductive health outcomes, in a large cohort of migrants; thus, contributing essential evidence to the migration health field. It also provides a framework for further phenotyping work to study migrant health in other primary care databases. Results from this and any future phenotyping work can then be used to inform the development and implementation of policies that promote equitable healthcare for international migrants presenting to primary care.

Conclusions
We used a migration phenotype to identify a large cohort of the UK migrant population and demonstrated the feasibility of using CPRD GOLD to undertake large-scale populationbased migration health research in the UK. This will allow researchers and policymakers to use primary care EHRs to monitor health outcomes and healthcare in migrants for evidence-based action. However, migrants were under-recorded in the CPRD GOLD database compared to ONS population estimates, particularly in older age groups who may have been in the country longer. Migrants in CPRD GOLD were largely representative of the UK migrant population in terms of sex and geographical region of birth. Improvements in the recording of migration status in CPRD were also observed over time.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/ijerph182413304/s1, Table S1: Code list to create 18 categories of ethnicity, and a 6-category higher level grouping of ethnicity, Table S2: Migration phenotype, including Read (V2) codes, Read terms, Medcodes, type of code and level of certainty of migration status, Table S3: Number and percentage of recorded migrants in CPRD GOLD per year by certainty of migration status (1997-2018), Table S4: Demographic characteristics of recorded migrants in CPRD GOLD at the time of the 2011 census by certainty of migration status (England and Wales), Table S5: Migrants as a percentage of all patients registered in a CPRD GOLD practice region (2009-2018), Table S6: Percentage of individuals recorded as migrants in CPRD GOLD and ONS country of birth (2004-2018), Table S7: Percentage of CPRD patients recorded as migrants and ONS 2011 census percentage of migrants in the population by age band (2011), Table S8: Age breakdown of migrants in CPRD GOLD and in ONS at the time of the 2011 census, Table S9