Correlation between the Bilingual Status and the Onset Age of AD and MCI Subjects: Evidence from the ADNI dataset

Background: This paper investigates the statistical relationship between bilingualism and the Onset Age (OA) of AD and MCI across a clinical sample, consisting of 580 Alzheimer’s Disease (AD) subjects and 1264 Mild Cognitive Impairment (MCI) subjects, via a statistical analysis conducted on the sample retrieved from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. Method: To investigate whether bilingualism has any correlation with the OAs of AD or MCI subjects, our study leverages the full potential of the ADNI dataset, a dataset that covers both the OA and the bilingualism status of both the AD and MCI subjects. Prior to performing any meaningful statistical analysis, a regression model and a probabilistic model were developed in parallel to fill in the missing OA and bilingualism values. A simple least-square regression model that consists of an independent variable of registered age for Mini-Mental State Examination (MMSE) score was used to estimate the OA of the AD and MCI subjects in the ADNI dataset. After filling in the missing OA values, the number of subjects relevant for the statistical analysis increased from 816 (AD: 371, MCI: 445) to 1844 (AD: 580, MCI: 1264), which greatly enlarged the representation of the AD and MCI sample in the ADNI population. With increased sample size, a novel probabilistic classification model was introduced to infer an ADNI subject’s bilingualism when relevant demographic information and deterministic outcome were not readily available from the ADNI dataset. The weighted average OA for the bilinguals and the monolinguals was then computed, where the weights for the probabilistic labels were assigned based on the percentage of bilingualism in the general US population. Finally, a statistical analysis was performed to test whether any statistically significant correlation exists between the OA and the bilingualism of the AD and MCI subjects within the ADNI dataset. Findings: Our preliminary study demonstrates no significant statistical difference between the OA of the bilinguals and the monolinguals within the ADNI dataset. Thus, the monolingual speakers within the ADNI dataset do not statistically manifest earlier onset, as compared to the bilingual speakers, which is slightly inconsistent with some earlier statistical findings that bilingual speakers enjoy certain distinctive advantages, such as late onset of AD, as compared to monolingual counterparts.

meaningful statistical analysis, a regression model and a probabilistic model were developed in parallel to fill in the missing OA and bilingualism values. A simple least-square regression model that consists of an independent variable of registered age for Mini-Mental State Examination (MMSE) score was used to estimate the OA of the AD and MCI subjects in the ADNI dataset. After filling in the missing OA values, the number of subjects relevant for the statistical analysis increased from 816 (AD: 371, MCI: 445) to 1844 (AD: 580, MCI: 1264), which greatly enlarged the representation of the AD and MCI sample in the ADNI population.
With increased sample size, a novel probabilistic classification model was introduced to infer an ADNI subject's bilingualism when relevant demographic information and deterministic outcome were not readily available from the ADNI dataset. The weighted average OA for the bilinguals and the monolinguals was then computed, where the weights for the probabilistic labels were assigned based on the percentage of bilingualism in the general US population.
Finally, a statistical analysis was performed to test whether any statistically significant correlation exists between the OA and the bilingualism of the AD and MCI subjects within the ADNI dataset.

Findings:
Our preliminary study demonstrates no significant statistical difference between the OA of the bilinguals and the monolinguals within the ADNI dataset. Thus, the monolingual speakers within the ADNI dataset do not statistically manifest earlier onset, as compared to the bilingual speakers, which is slightly inconsistent with some earlier statistical findings that bilingual speakers enjoy certain distinctive advantages, such as late onset of AD, as compared to monolingual counterparts.
Keywords: Alzheimer's Disease, Onset Age, Bilingualism, Cognitive Reserve, Dementia, Mild Cognitive Impairment, ADNI database Significance Statement: This paper seeks to overcome a limitation within previous studies, namely, small sample size. By making appropriate assumptions, data were interrogated from 580 AD and 1264 MCI subjects, to investigate multiple AD and linguistic data from the ADNI dataset. This study also provides a way to manage non-deterministic linguistic outcomes to facilitate more rigorous statistical analysis. Most importantly, our preliminary statistical study that investigates the correlation between OA of AD/MCI and bilingualism based on, ADNI, a large clinical dataset shows no conclusive distinctive advantage for bilinguals over monolinguals in terms of delayed AD/MCI onset. Thus, more in-depth investigation might be

Introduction
Dementia is a general description for a set of symptoms associated with the deterioration of cognitive abilities such as an individual's episodic memory, verbal skills, reasoning etc., that can gradually affect one's ability to perform daily activities. According to the WHO, it is estimated that around 50 million people are affected by dementia worldwide, with this figure increasing at a rate of 10 million new cases per year, and creating a substantial economic burden of almost a trillion dollars (WHO, 2020).
Alzheimer's Disease (AD) is an irreversible and progressive neurodegenerative disease which accounts for up to 70% of cases of dementia, and, therefore represents a high priority for the development of effective interventional strategies. Such approaches that aim to prevent neurodegenerative disorders such as AD and Mild Cognitive Impairment (MCI) can be generally divided into primary, secondary and tertiary prevention, according to the different stages of disease development (Fratiglioni et al., 2007). This study will focus on primary prevention by identifying protective factors that may decrease or delay the development of AD.
Several studies (Alladi et al., 2013;Bak et al., 2014;Bialystok et al., 2014) provided evidence that bilingualism may be a contributing factor that helps to defer the onset of symptoms of AD.
This study makes use of evidence from one of the largest AD datasets, the ADNI dataset, to investigate the claim that bilingualism possesses preventive effects against neurodegenerative disorders and delays the OA of AD and MCI.

Cognitive Reserve
The concept of cognitive reserve was proposed by Stern (2009) to explain the discrepancy between the degree of brain pathology and clinical manifestations. Previous studies reported that 25% of elderly who performed normally during neuropsychological tests were found to meet the full pathologic criteria for AD, indicating that some people can cope better than others despite similar degrees of brain damage. Cognitive reserve is believed to provide a level of resistance to neurological damage, possibly as the result of increased synaptic plasticity, compensatory use of alternate brain areas, or enriched brain vasculature (Fratiglioni et al., 2004). This prompts us to find contributing factors to cognitive reserve that may help postpone the onset of symptoms of AD and MCI.

Defining Bilingualism
Previous studies argued that various environmental factors may affect the onset age of AD. Fratiglioni et al. (2004) suggested that an active and socially integrated lifestyle in late life may impose protective effects against AD. Scarmeas et al. (2001) found that participation in leisure activities decreases the likelihood of incident dementia and may provide a level of cognitive reserve that delays the onset of symptoms of dementing diseases. Fratiglioni et al. (2007) further investigated potential risk factors for AD throughout a lifespan, and argued that that socio-economic factors, including educational level and occupation, as well as life-habits may also affect the risk of AD.
In addition to the features discussed above, there is growing evidence that bilingualism may also defer the onset of the symptoms of AD and MCI. Bialystok et al. (2007)  education. Their study indicated that bilingual AD patients possess significantly higher levels of cerebral atrophy in areas closely related to AD, supporting the claim that bilingualism leads to increased cognitive reserve. Evidence from studies on cerebral glucose metabolism in MCI and AD also support a role for bilingualism in increasing cognitive reserve, reporting that bilingual patients have more severe brain changes than monolinguals when adjusting for severity of cognitive impairment (Kowoll et al., 2016).

Methodology
After reviewing previous studies investigating the effect of bilingualism on AD, we have identified the key factors to be taken into account in the statistical analysis (see Table 1). The commonly used confounding variables include education, MMSE score, occupation, lifestyle, ethnicity, race, language proficiency, etc. With most of the clinical studies mainly carried out on relatively small sample sizes (n<200), our study intends to leverage extensive AD databases and see if the pattern holds true over larger sample sizes. In particular, we utilized the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, which is one of the most comprehensive, precise and readily available AD dataset currently (Mueller, 2005).
Nevertheless, the ADNI dataset has no label for bilingualism as well as explicit information about subjects' proficiency in their primary and secondary language. Thus, we propose a methodology to create a probabilistic bilingual classification that consists of three steps (see Figure 1). First, we selected AD and MCI subjects from the ADNI dataset and extracted their information relevant to our study, including OA and demographic information. Second, we filled in the missing OA values with a regression model, and assigned a probability classification to subjects not explicitly determined to be bilinguals/monolinguals. Third, based on the pre-processed data, we examined whether bilingualism has any correlation with the OAs of AD or MCI subjects.  Partial correlation coefficients * The bilingualism composite factor was calculated by performing a PCA using variables: years of language exposure to both languages for speaking; self-rating of language proficiency in both languages for speaking, comprehension, writing, and reading; the percentage of language usage; and frequency of language switching.

Subject Selection and Variable Definition
We have selected a total of 1844 (AD: 580, MCI: 1264) subjects from the ADNI dataset. We then extract the relevant features from the dataset for statistical analysis (Table 2). Step 1 Data collection: ADNI dataset Subject selection: AD (580) and MCI (1264) Variable definition: onset age (OA) and bilingualism Step 2 Filling in missing OA values using first visit age Bilingualism probabilistic classification Step 3 t-test using bilingualism probabilities as weights OA difference between bilinguals and monolinguals  (1) and (2)), so the diagnostic age definition is used.
Bilingualism (independent variable): According to Hamers et al. (2000), bilingualism consists of a wide range of definitions ranging from a native-like competence in both languages to a minimal proficiency in a second language. In order to prevent any confusion, this study defines bilingualism as mastering the second language with minimal proficiency. The ADNI dataset does not indicate the bilingual status of individuals. However, it includes information on primary language and tested language (the language that the subject had been tested when admitted to the ADNI clinical trial). We take subjects who speak a primary language different from his/her tested language as bilinguals. For the remaining subjects, we adopt a probabilistic classification approach to determine the subject's bilingualism (see Section 3.3.2 for more details to predict the subject's OA. Figure 2 shows the results of the fitted regression model for estimating the OA based on the first visit age. From the model, first visit age is a significant predictor of the OA with p< 0.001. Moreover, R 2 >0.999 also indicates that the regression model is able to explain most of the variance in OA by using the first visit age only. In order to assign a probabilistic label to the participant's bilingualism based on his or her ethnicity, we will use the ethnicity and language characteristics of the US population from US census data as our baseline data (United States Census Bureau, 2019). Table 3 shows the statistics describing the language use at home of the population in the US based on their ethnicity.  Table 4, to observe whether the weighted average OA will be consistent. After going through one of the two scenarios in Table 4, each subject was assigned either one (probability of zero or one) or two (probabilities for being bilingual and monolingual) probabilistic labels. For example, if a subject is Hispanic, two entries will be created in the dataset, one with a bilingual label and one with a monolingual label, with probabilities of 0.719 and 0.281, respectively. If a subject's primary and tested language is different, then we determine this subject is bilingual with probability one, and so only one entry will be created. Finally, the average AD/MCI OA weighted by probability can be computed for bilingual entries and monolingual entries separately. • If the subject's primary language is different from his/her testing language, then it is replaced with one entry, with a bilingual label and probability one 1 0 100% • Else if (1) the subject's primary language and testing language are both English and (2) his/her race is Black American or White American, it is replaced with a monolingual label and probability one 0 1 100% • Else, a subject is replaced with two entries: o One with the bilingual label and the bilingual probability conditioning on his/her ethnicity o Another with the monolingual label and the monolingual probability conditioning on his/her ethnicity o Using information based on the 2019 US census data, specifically the percentage of the US population speaking English only. • If the subject's primary language is different from his/her testing language, then it is replaced with one entry, with a bilingual label and probability one 1 0 100%

Non-Hispanics
• Else, a subject is replaced with two entries: o One with the bilingual label and the bilingual probability conditioning on his/her ethnicity o Another with the monolingual label and the monolingual probability conditioning on his/her ethnicity o Using information based on the 2019 US census data, specifically the percentage of the US population speaking English only.

Statistical Analysis on the Correlation between OA and Bilingualism
We tested the statistical correlation between OA and bilingualism using a two-sided t-test.
More specifically, we adopted the Welch's t-test, which does not assume that the two groups have equal variance. The Welch's t-test evaluated the mean difference between the bilingual and the monolingual groups (probabilistic entries), using their corresponding probabilities as the weights. The null hypothesis is that there is no OA difference between the bilinguals and the monolinguals. A p-value greater than 0.05 is considered not statistically significant, indicating strong evidence for the null hypothesis.

Results
The statistical results for the aforementioned two scenarios with p-values are listed in Table 5.
In general, the p-values range from 0.43 to 0.89 and from 0.15 to 0.93 for AD subjects and MCI subjects, respectively. No OA difference can be observed at p<0.05 or p<0.01, across all scenarios and all population groups (including Hispancs and Non-Hispanics). assuming that the diagnosis provided in the ADNI study is reliable.
Bilingualism classification for the elderly population: In this study, there is a mismatch between the ADNI population and the US census population. More specifically, the census data across all age groups was used to estimate the bilingual ratio among Hispanics and non-Hispanics. In the future, a more detailed breakdown of the US census data, including age and ethnicity data, could be used to derive the bilingualism probability for the elderly population.
The degree of bilingualism: In this study, the degree of bilingualism was not taken into account. The US census data has provided two labels to determine the degree of bilingualism (for people who speak another language spoken at home such as Spanish): speaking English very well and speaking English less than very well. In the future, we will evaluate whether or not the effects of bilingualism on OA are different with regard to the three levels of bilingualism (i.e., English only, speaking another language and speaking English less than very well, and speaking another language and speaking English very well).
The effects of bilingualism on certain cognitive tasks: In this study, we only investigated the relationship between bilingualism and OA. However, bilinguals may only have a certain advantage over certain tasks. The ADNI dataset has provided four composite measures, each representing a different aspect of cognitive ability: ADNI-EF (executive functioning), ADNI-MEM (memory), ADNI-LAN (Language), and ADNI-VS (visuospatial functioning). In the future, we will examine the effects of bilingualism on ADNI-LAN and other cognitive abilities.
Confounding bias: In this study, the statistical association between OA and bilingualism may be biased due to uncontrolled confounding factors. Further studies that account for the important confounding factors, such as education, occupation, lifestyle, physical activities, and diet etc., are needed. For example, analysis of covariance (ANCOVA) can be performed by controlling for occupation and education. These factors can be converted into quantitative indicators, e.g., using a 5-category scale according to the International Standard Classification of Occupation for occupation history (International Labor Office, 2012).

Conclusions and Policy Recommendations
Our study explored the relationship between the OA of AD/MCI patients and their level of bilingualism using ADNI, a comprehensive AD/MCI database that consists of 1844 AD/MCI subjects. We proposed a novel methodology to predict missing OA values and level of bilingualism with a linear regression model and probabilistic classification model respectively, providing an alternative to manage non-deterministic linguistic outcomes and facilitate more rigorous statistical analysis. Based on our statistical model, the difference in OA between monolingual and bilingual AD/MCI patients is insignificant (p>0.05). This suggests that more in-depth studies to investigate the effects of other compounding factors such as education and occupation levels and extensive linguistic data collection from AD/MCI patients are required.
While ADNI is already one of the most extensive and comprehensive AD/MCI databases available worldwide, the fact that the subjects' level of bilingualism are yet to be provided by ADNI, suggests that there is a general lack of awareness that linguistic features can be good indicators or markers of AD . Thus, there is a strong need for health policy-makers and research communities worldwide to collect more extensive linguistic data from the AD/MCI population, as well as more rigorous studies to be conducted internationally and locally, to facilitate linguistic-based investigation of AD and evidence-based health policymaking for the vulnerable AD communities.