Sociodemographic and Genetic Influences on Dietary Patterns and Their Influence on Health Outcomes in the Atlanta Center for Health Discovery and Well Being Cohort

Diet influences, and is influenced by, a wide range of socioeconomic, cultural, geographic, and genetic variables. Here we survey a matrix of such interactions as well as their connection to a variety of health outcomes, in a cohort of 689 diverse adults employed at Emory University and enrolled in the Center for Health Discovery and Well-Being (CHDWB) study. Principal component analysis (PCA) of the Block Food Frequency Questionnaire revealed seven PC cumulatively explaining 25.8% and each individually at least 2% of the proportional consumption of 110 food items. PC1 is strongly correlated with the Healthy Eating Index-2015 measure, and accordingly healthier scores associate with multiple measures of physical and mental health. It, as well as PC2 (likely a measure of food expense) and PC3 (carbohydrate versus protein consumption) show significant geographic structure across the Atlanta metropolitan area, correlating with race and ethnicity, income level, age and sex. Notably, a polygenic score for body mass index (BMI) consisting of 281 SNPs explains 2.8% of the variance in PC5, which is as strong as its association with BMI itself. PC5 appears to differentiate participants with respect to conscious eating behavior related to the choice of diet or comfort foods. Our analysis adds to the growing literature on factor analysis of socio-demographic influences on nutrition and health.


Introduction
While it is well established that obesity and metabolic disease are mediated in part by total food intake, and the basic components of a healthy diet are well-known [1,2], rates of conformity to healthy diet recommendations differ widely across populations. Variation in diet has also been suspected as one of the leading mechanisms mediating the relationship between socioeconomic status and health outcomes [3,4]. Reciprocally, genetics and health-related factors also contribute to dietary choice [5,6]. Much remains to be learned about the distribution of dietary patterns across different socio-demographic, genetic and health spectra as well as the relative effect of these variables on dietary preference. physiological, psychological and lifestyle profiles for all participants of a longitudinal health promotion program [22]. The participants were generally healthy and active employees without uncontrolled chronic disease conditions, drawn at random from all sectors of Emory University, including a breadth of social backgrounds and ethnicities as indicated in Table 1. Sociodemographic information was collected at recruitment using an electronic Personal Information Form. This included gender, race/ethnicity, age (computed from date of birth and visit date), household income in ordinal levels, educational attainment in years of schooling completed, marital status, and zip-code of residence. Associated with each visit over a four-year period, with 6-month intervals between the first three visits, and 12-months thereafter, participants were asked to complete a web-based Block FFQ. We analyze data for a total of 689 participants who reported total daily caloric intake in the range of 700-4200 kcal. They also had body composition measurements, blood was drawn for a comprehensive metabolic and immunological profile, and a range of surveys of health-related behavior facilitated computation of the Beck Depression Index (BDI) [23], General Anxiety Disorder-7 score (GAD-7) [24], Perceived Stress Scale (PSS-14) [25], Epworth Sleepiness Scale (ESS) [26], and the SF36 Quality of Life Survey [27]. Here we report on only the first visit since completion of the survey was variable at subsequent time-points, though the major PC of diet remain similar in a dataset including 2552 surveys. Additional details and analysis of health outcomes are provided in our previous publications [18,19,22].

Dietary pattern analysis
The dietary questionnaire used in the study was a version of the semi-quantitative FFQ-2005 administered over the internet by NutritionQuest. It included 110 food items with specified serving sizes described in natural portions (e.g. 1 banana) or standard weight and volume measures of common servings. For each food item, participants indicated the intake frequency and number of portions per intake based on 7-day recall. Daily consumption of each food was calculated by multiplying weekly intake frequency with number of portions consumed, divided by 7 days. The questionnaire thus returned a matrix of food consumption data, along with software-generated dietary and nutritional measurements that align with the 2015-2020 USDA Dietary Guidelines for Americans [28]. Some examples of the dietary and nutritional measurements are: caloric intake per day, cup equivalents of whole fruits consumed per 1,000 kcal, and percent of energy that comes from saturated fats and from added sugar [9].
We computed HEI-2015 [10] from the dietary and nutrition summary data generated by NutritionQuest. Some variables required for the computation of HEI were not present in our FFQ dataset and were thus excluded or replaced by other variables. "Milk, including soy milk (cups)" was not present and the total dairy category was represented by "total cup equivalents of milk, yogurt and cheese" only. Similarly, "non-juice fruits" was replaced by "solid fruits", representing whole fruits. "Lean meat from soy products, excluding soy milk" was replaced by "lean meat from total soy products". Additional details of HEI calculation and FFQ variables used can be found on the NCI (https://epi.grants.cancer.gov/hei/) and WHI (https://www.whi.org/researchers/data/Pages/Available%20Data.aspx) and websites.
The proportional food consumption for the entire study cohort is summarized in Figure 1 in which the 110 food items are grouped into 23 categories based on an established food classification method [16] which reduces the total number of items to be analyzed while retaining much of the variety. As might be expected, the largest consumption was observed for fruits, sweets and deserts, refined grains, and meats (processed and red). Weekly consumed amounts of each food item were calculated by multiplying the intake frequency with number of portion sizes per intake. The missing food frequency and quantity data was imputed using the expectation-maximization method [29]. Initial analysis indicated that the dietary variation was mainly driven by the total amount of food consumed, rather than the proportions of each food. While interesting and relevant to the influence of psychosocial stress on health, for the purposes of this study we considered it a bias to be overcome. Consequently, we transformed the food amounts into their relative proportions in each person's diet by and dividing each food amount by the sum of all the food amounts. These values, after standardization into zscores, were used as the entries into Principal Component Analysis (scikit-learn version 0.19.1, Python). Although the percent variance explained by each the major PC decreased slightly relative to PC generated with non-proportionalized data, the contributions of different items to each factor was more spread out and the scores more normally distributed. The Kaiser criterion suggested 36 significant components, however, since most of these showed no obvious dietary associations and were thus not helpful for later analysis, we instead examined the scree plot, which suggested a cutoff with variance explained > 2%, and seven principal components were retained. These cumulatively explained 25.8% of the variation in inferred dietary proportions in the cohort.
Geographic projection was performed using the "leaflet" package (https://rstudio.github.io/leaflet/) in R. Zip codes were combined into 23 zones based on geographical proximity, division of census tracts, similarity of neighborhoods in terms of sociodemographic profile and grocery store density, so that the number of individuals in each zone was roughly equal.

Polygenic score assessment
Genotyping of 423 individuals in our sample was performed on genomic DNA extracted from whole blood samples using either the HumanCoreExome-12 v.1.1 or HumanOmniExpress-12 v1.1 genotyping Illumina arrays [30]. Imputation was performed using IMPUTE v2. software [31] with 1000 Genomes data, resulting in 8,242,192 imputed SNPs. A polygenic score for BMI (PGSBMI) was calculated using the linear scoring function in PLINK v2.0 [32] using reference GWAS data accessed from the EBI GWAS catalog (https://www.ebi.ac.uk/gwas/publications/30108127). The score includes 281 of the 289 SNPs reported in [33] with association p-values ranging from 2×10 -210 to 9×10 -6 . BMI and prevalence of obesity defined as BMI >= 30 were plotted against polygenic risk scores for 404 participants with all necessary data available, to see if these SNPs correlate with the BMI trait in our cohort. Simple linear regression was performed to test whether PRSBMI has significant effects on the major Principal Components of dietary variation. Similarly, PGSWHR for 402 people was derived with 307 of the 316 independent SNPs in [34], each with p-values ranging from 5 x 10-183 to 5 x 10 -9 and accessed at https://www.ebi.ac.uk/gwas/studies/GCST008996. Obesity in this case was defined as WHR >= 0.9 for males, or WHR >= 0.85 for females.

Statistical Analyses
Statistical analyses were performed in JMP Pro 14.3 (SAS Institute, Cary, NC). The distributions of the first 3 diet principal components were assessed and described by the sociodemographic characteristics (i.e. gender, age, race/ethnicity, education, income, marital status and zone of residence). Differences in the means of the PCs between genders were assessed by 2tailed t-test assuming equal variance. Differences in PC distributions among levels of other categorical/ordinal variables were first assessed by one-way ANOVA. For variables that have an intrinsic linear nature (i.e. age group, household income level, education level), we then used the orders of the categories as "dummy numerical variables" to perform linear regression. In order to further investigate the relative effect sizes of the sociodemographic characteristics on the PCs, we then performed multivariable linear regression with the 6 variables described above. Associations between health and diet were measured by Pearson correlation between each PC and each continuous measure of physical, metabolic or mental health. ANOVA was used to evaluate associations with clinical illness by categorizing the participants into 6 health groups (obese, hypertensive, diabetic, and combinations thereof, as well as controls.)

Principal Components of Dietary Proportions
The food items that load most strongly onto the first seven principle components are listed in Table 2. Each PC captures different aspects of overall diet that we subjectively classify into vegetarian (PC1), expensive (PC2), high carb (PC3), soups (PC4), juices or typical diet foods (PC5), and fish based diets (PC6), with corresponding negative loadings for unhealthy Western food items, inexpensive processed goods, high protein, breakfast, commonly consumed items, and red meat, respectively, while PC7 is more difficult to categorize. PC1 in particular might alternatively be conceptualized as capturing a healthy diet including a large proportion of fruits and vegetables. PC1 also has a significant linear relationship with Healthy Eating Index-2015 (R2 = 0.354, p <0.0001), further corroborating PC1's representation of the healthy versus fast food eating axis. food eating tendencies, with PC1 and PC2 as exemplars. Broadly speaking, PC1 tracks with wealth, being higher in the more affluent regions of Midtown, Decatur (near-east of Atlanta), and the upper and middle class suburbs of Roswell and Marietta. PC2 is markedly divided between north and south Atlanta, likely reflecting access to fresh and more expensive foods in the north, and higher prevalence of food deserts in the south. This distinction also tracks with the historical segregation of Atlantans by ancestry. All five PC are highly significantly differentiated by region (ANOVA, p<10 -5 ). Orthogonally, we also performed regression analysis to evaluate dependence of the three largest PC, each of which explains over 2% of the food item variance, for each of the social factors gender, self-reported race and ethnicity, age group, education level, household income, and marital status. Age was grouped by 10-year intervals, education level was categorized as high school or less (6-12 yrs), some college or college graduates (13-16 yrs), graduate school (17-22 yrs) or post graduate school (>22 yrs), wealth was binned in $25,000 or $50,000 increments as shown, and marital status was categorized as single, married or divorced by excluding 8 widowed individuals. The bins were assigned increasing numerical values for evaluation of the significance of the regression, with the exception of race/ethnicity which was evaluated by ANOVA. Salient results are presented in Table 3.
Overall, dietary patterns varied along the socioeconomic gradient and self-reported race was the most strongly portioned among the 7 tested variables (P <0.0001 for PC 1-5). The healthyeating PC1 was observed to be higher in females (mean PC1Female = 0.41, mean PC1Male = -0.78), Asians (2.58 vs 0.01 for European and -0.62 for African American), and generally increased for participants with higher education level or higher household income. A particularly strong gradient by income was also observed for PC2, confirming inference from the geographic analysis.
Multivariate analysis indicated that gender dominates the association with PC1, but age as well as race and ethnicity independently contribute as well (p<0.0001 each category). Furthermore, race/ ethnicity, gender and income level have significant independent influences on PC2, whereas only race/ethnicity and age are associated with PC3 in the multivariate analysis. Neither education level nor marital status were significant when analyzed alongside the other variables.  Note. ***P <.0001, **P < .001, *P <.05

Health correlates with dietary principal components
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 July 2020 doi:10.20944/preprints202007.0106.v1 The relationship between dietary patterns and health was investigated by correlating the main dietary principal components with the participants' clinical profiles. A total of 45 health related traits were examined, including 3 physical traits, 19 items from a comprehensive metabolic panel (CMP), 8 items from the complete blood count (CBC) test, 4 mental health related measures and 8 overall summary scores from the SF36 Quality of Life Survey. Pearson linear and Spearman rank correlations gave very similar results, and we report the Pearson's correlations for 15 traits with particularly strong correlations to specific PC in Table 4, in which significant positive correlations are shaded red and negative blue. Most notable in this analysis is the strong correlation between PC1 and most health measures, indicating the expected positive impact of a healthier diet in general as well as specific aspects of physical and mental well-being including vitality. Note that negative correlations are due to the association of larger values of sleep, anxiety and depression scores, weight and blood pressure, and basal metabolic rate, with poor health. Perhaps surprisingly, PC2 which captures a more expensive diet, is also negatively correlated with BMI and body fat percent, and an association of a high carb diet (PC3) with reduced body weight was seen. Consumption of inexpensive processed foods implied by high values of PC2 is very clearly associated with elevated waist-to-hip ratio, and mildly with mental health concerns. Notably, and also unexpectedly, the high fish diet implied by PC6 strongly correlates with high body fat percent and BMI These results are consistent with diet being a major contributor to chronic disease. To further investigate this, we next performed a categorical analysis designed to evaluate whether obese individuals (BMI >= 30 kg/m2), hypertensives (blood pressure greater than 140/90 mmHg), and diabetics (mean blood sugar over 126 mg/dL) had abnormal PC scores. These clinical conditions are all pharmacologically controlled in the study participants. Joint incidence of all three Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 July 2020 doi:10.20944/preprints202007.0106.v1 conditions was observed in 15 individuals, and for one or two of the conditions in one third of the samples, leaving approximately two-thirds of the CHDWB cohort classified as relatively healthy controls. Figure 3 shows marked differences in the first three PC with respect to these three chronic conditions, with effects in the expected direction. Healthy status associates with high values of PC1 and PC3, but there was surprisingly little differentiation with respect to PC2. Individuals with all three conditions tend to have the most extreme dietary consumption patterns. We caution against over-interpretation of individual comparisons due to small sample size of some categories and presence of confounding variables.

Polygenic association of BMI with health-conscious dietary preferences
In order to assess whether genetic variation for body weight might act through dietary preference, we evaluated polygenic scores (PGS) for BMI and WHR using independent genotype weights for 281 and 307 SNPs respectively and weights ascertained by the contributing studies [33,34]. Imputed whole genome genotypes were available for 410 individuals, and despite the small sample size, the expected positive correlations between PGSBMI and BMI ( Figure 4A) and prevalence of obesity ( Figure 4B) were clearly observed. The polygenic score explains approximately 3% of the variance for BMI after excluding a handful of individuals with extreme BMI over 40. Each point in Figure 4B represents 41 individuals and there is a clear trend for increasing proportion of obese individuals as the PGS decile increases.
There does not appear to be any association between polygenic predisposition to BMI and dietary PC1 ( Figure 4C) or any of the other PC, with the marked exception of PC5 ( Figure 4D) where the regression explains 2.8% of the variance (p=0.0005), being approximately as strong as the association with BMI. Consideration of the loadings on PC5 suggests that high values reflect consumption of fresh juices, diet shakes and other items typically consumed by dieting persons, whereas positive ones might be related to more filling foods like ice cream, soft drinks and tacos. The negative association of PC5 with BMIPRS is consistent with the interpretation that polygenic risk for obesity is mediated through health-conscious eating behaviors and satiety. No associations with PGSWHR were observed.

Discussion
This study investigated the correlation of four major types of non-dietary factors with people's diet, namely geography, socioeconomic standing, health status, and genetics. Few studies have considered combinations of these subjects together, despite the general acceptance of the concept that diet is one of the major factors that connects SES with health. Our results are entirely consistent with the supposition that a healthy diet strongly associates with education, income, and access to quality food, with highly significant health outcomes as argued by [35,36]. In the context of Atlanta, a cosmopolitan city with a large African American population, it is also evident that these cultural factors are confounded with race and ethnicity. Our results regarding the geographic distribution of dietary patterns are also consistent with descriptions of unique features related to so-called food deserts, where access to food is dominated by dollar stores, gas station food marts or fast food establishments. It is evident that food access and environment has a major effect on the resident's diet choices, but confounded by disparities due to racial and SES factors makes it difficult to parse specific contributions.
In contrast to the conventional dietary analytic approach focusing on a single nutrient or a summary score, in this study a regression-based methodology based on principal components was applied to food proportions computed from dietary surveys. We show that this allows association of specific aspects of dietary choice with other variables. The PCs identified with our approach represent the most significant dimensions of eating behavior, and were specifically designed to capture proportions of consumed food items rather than overall consumption amounts. The reproducibility and validity of this approach was initially discussed by [7]. We note that recent machine learning approaches may reveal stronger and more consistent clusters of dietary patterns, and that their utility in nutrition research is just beginning to be tapped [37].
The present data reveal significant relationships between dietary patterns and health status, supporting a link between healthy eating habits and well-being. Intuitively, what we eat can affect our physical health, and this is apparent in the trend for people having a balanced diet with plenty of fruits and vegetables being at lower risk for diet-related disease and "healthier". However, health status, or the perception of health status, is likely to reciprocally impact dietary choice as well: one example is that people with diabetes tended to consume less sugary food than the remainder of the cohort. It is in general difficult to ascribe the direction of causality to any of the described relationships.
One caveat of the study was that the Block FFQ is only semi-quantitative, is biased by self-recall of eating habits, and does not survey subtle but important distinctions such as types of salad dressing and kinds of cooking oil. It does include calibration questions needed for computation of nutritional intake calculations, but we elected instead to focus on food group proportions and so these did not contribute to the analyses. Nevertheless, the approach is used widely in the nutrition literature and is accepted to capture broad trends in food consumption.
Recently, a number of large-scale genome-wide association studies have begun to attribute genetic factors to BMI, WHR and obesity [33,34] as well as to patterns of dietary consumption. A GWAS on 85 single food intake and 85 principal components of diet in FFQ data for the UK Biobank [38], identifying 136 associations specific to dietary choices such as white versus wholemeal/wholegrain bread consumption. Many of these link to olfactory receptor associations for example with fruit and tea intake, but Mendelian randomization failed to adduce strong evidence for a causal role in coronary artery disease or diabetes. We provide preliminary evidence that one component of dietary intake, PC5, which possibly captures a measure of health-conscious eating, is significantly correlated with a polygenic score for BMI. This finding is consistent with the enrichment of neuronally expressed genes in the BMI GWAS loci and the notion that this PGS mediates its effect in part through the propensity to diet. Our results also indicate how important it will be to control genetic analyses for cultural and socioeconomic confounders which are major mediators of dietary behavior.