The Demographic, Social, and Economic Correlates of HIV Infection Status in Sub-Saharan Africa

Key points: In a continent-wide search correlating 8,980 variables with HIV+ in >600,000 individuals from 49 countries across Sub-Saharan Africa, we identified hundreds of social determinants, such as education and marital status, whose associations are comparable to established causal factors. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 21 December 2020 doi:10.20944/preprints202012.0507.v1


Introduction
The number of new HIV infections in Sub-Saharan Africa has been declining for over a decade, but the prevalence of HIV has not declined, over 1 million people are newly infected every year, and over 5 million are unaware of being HIV-positive. [1,2] The global strategy for improving the population burden of HIV calls for intensive identification of HIV-infected individuals in order to pursue high treatment coverage, which, in turn, is anticipated to reduce HIV transmission and incidence. [1] Sub-Saharan Africa --where the burden of HIV is highest -remains short of these major targets, including identifying at least 90% of those with HIV. [1] One challenge hindering progress at identifying those with HIV infection is the heterogeneity of HIV risk in terms of modes of transmission and prevalence. In other words, it is hard to know where and whom to target for testing and public health interventions when aiming for very high (>90%) identification rates. This heterogeneity is manifest in the proliferation of inquiries into HIV risk factors, without unifying approaches or systematic studies into HIV risk. [3][4][5][6][7] At the time of this writing, no systematic reviews of HIV risk factors in Africa have been published. In this paper we describe a systematic approach to the identification of HIV risk and across multiple populations and geographic regions of Africa. The goal of this analysis is to help with identifying groups that may benefit from HIVspecific public health efforts.
The epidemiology of risk factor identification commonly proceeds by identifying candidate risk factors.
These approaches are grounded in real-world observations but suffer from several limitations: generalizing to populations other than those directly examined is often tenuous; testing of one or several candidate risk factors may spuriously identify candidate risk factors; [8,9] and limited number of candidate risk factors may leave important relationships unobserved. [10] We address these shortcomings using a large-scale risk factor testing approach across the entire population of individuals that had an HIV test as part of the Demographic and Health Surveys (DHS) between 2003 and 2017 in Sub-Saharan Africa, a population of over 700,000 individuals extending our previous work to identify factors associated with HIV in Zambian females. [11] We test every variable or candidate risk factor available in DHS against individual-level HIV outcomes, a total of 32,353 variables (27,506 in females and 24,713 in males). This approach allows us to systematically identify variables that have a high degree of generalizability as risk factors for HIV, that play an important role in explaining who does and does not have HIV, over multiple survey waves across region of Sub-Saharan Africa in men and women separately.

Data Sources
The primary data for this work comes from the Demographic and Health Surveys (DHS). The DHS are nationally representative surveys conducted approximately every 5 years in many low-and middleincome countries. We used data from every individual who had consented and tested positive or negative for HIV-1. [12] There were a total of 50 surveys available for analysis for females and males each across 29 countries.

Selection and preparation of variables per survey
We retained all variables that had at least 90% complete data, a total of 29,092 and 25,980 unique variables in female surveys and male surveys, respectively, and 33,729 unique variables overall in both males and females (see Supplementary Information)

Survey-Specific Association Models and Meta-Analysis Across Sub-Saharan Africa
We associated each of the variables with HIV status using a weighted logistic regression model: Where indexes the survey (including country, year, and female or male survey; e.g., Zimbabwe, 2005, female survey), indicates the individual observation (all analyses are person-level analyses), and indexes the variables for that survey. We estimated the Nagelkerke pseudo-R 2 to assess the improved goodness of fit from a logistic model with zero variables (equivalent to the prevalence of HIV) to a model with . [13] We used survey-weighted logistic regression model to account for the probability-based sampling of DHS, implemented in the survey package in R. [14] Last, for each variable, we combined associations across all of the surveys (year and country combination) for males and females with a random effects meta-analysis procedure, estimating the average association and heterogeneity of the associations across the surveys and countries (See Supplementary Information).
We prioritized variables that were the most explanatory and whose associations were statistically nonzero across the surveys after correction for multiple hypotheses. Specifically, we report variables across pan-Saharan-Africa meta-analytic p-value lower than a conservative DHS-wide Bonferroni threshold of 1x10 -6 (for 7,251 plus 6,288 variables, a Bonferroni threshold would be 0.05/13,539, or p < 3.7x10 -6 ) and whose average R 2 across the surveys were the top 25% of all R 2 for males and females, which was equivalent to a R 2 of 0.001.
For full transparency, a browsable version of all findings are available on a web application here: https://www.chiragjpgroup.org/dhs_hiv_meta/. We have placed all of our summary statistics, including the overall and country-specific meta-analytic odds ratios, R 2 , I 2 , standard errors, and pvalues in tables located here: https://github.com/chiragjp/dhs_hiv_meta/blob/master/meta_data/Online%20Supplementary%20Tables .xlsx?raw=true. See Supplementary Information for more information.

Cohort Characteristics
We harmonized 50 DHS surveys from 29 African countries conducted between 2003 and 2017, in which 619,468 participants were tested for HIV (47% male, 6.1% positive) ( Table 1). While we only analyzed surveys with HIV prevalence at or above 1% of the adult population, the variability in the prevalence levels ranged from 1% to 25% (Table 1). As described in the methods, there were 7,251 and 6,288 unique variables in females and males respectively (Supplementary Figure 1AB).
Distribution of associations across Sub-Saharan Africa points to consistency in direction but variability in size of associations.
Overall, we found robust but heterogenous associations between social, environmental, behavioral, and biological variables with HIV+ across the 29 countries of sub-Saharan Africa ( Figure 1A-C, Supplementary Figure S2 [for single country]). In the following, we call "identified variables" as those that had a Bonferroni-level p-value less than 1x10 -6 and a Nagelkerke R2 of greater than 0.001. Reach of the estimated overall and country-specific odds ratios, standard errors, and pvalues are documented in extensive online content (See Supplementary Information).

Variables identified across sub-Saharan Africa
We identified 344 (5.4% out of 6,288 possible) and 373 (5.1% out of 7,251 possible) variables in their association with HIV+ (R 2 greater than 0.001 and p-value less than 1x10 -6 ) in males and females respectively. For variables that were surveyed among female participants in 11-19 countries (a total of 432 variables), we identified 35 variables (8.1%). Second, for variables that were surveyed among female participants in 20-29 countries (a total of 449 variables), we identified 90 variables (20%). Third, in the male sample, we identified 38 (2.6% out of 1450), 34 (10% out of 334), 36 (11.7% out of 307) variables. A total of 168 and 154 variables were assessed in all 29 countries for females and males. Of these, we identified 31 (18%) and 16 (10%) in females and males respectively. Among the 31 identified variables assessed in all 29 countries in females, the median R 2 was 2.9x10 -3 , median odds ratio was 1.57, and the I 2 was 69.43%. For the 16 identified variables assessed in all 29 countries in males, the median R 2 was 2.5x10 -3 , median odds ratio was 1.39, and median I 2 was 60.6%. The distribution of the R2 were comparable for variables assayed in different number of countries ( Figure 1A), while variables with extreme odds ratios diminished when appearing in larger number of countries ( Figure 1B).
Variables broadly exhibited high heterogeneity in their odds ratios across countries ( Figure 1C), with a majority having I 2 greater than 50%.
We highlight several variables that show a striking relationship with HIV+ (Figures 2 and 3). The variable indicating women who are the head of the household was significantly associated with HIV in nearly 60% of study countries, explained 1% of the variation in HIV status on average (and nearly 4% in two countries), and was uniformly associated positively with HIV, with a meta-analytic odds ratio of 2.5 (2.5 increased odds for HIV+ relative to females who were not the head of the household) (min-max 1.1-3.5) for all countries (Figure 2A, first row). The analogous indicator, head of household among men, was identified in 40% of countries, explaining as much as 6% of the variability in HIV status, and was also positively associated with HIV+ status ( Figure 2B, 2nd row). As above, the heterogeneity in the odds ratios for females and males was 75%. In other words, there was substantial variability in the odds ratios.
Other indicators of marital status were associated with HIV+ across the subcontinent, including the number of unions, if the participant had been married, and if they were divorced among females.
First, females who had been in one union had 60% decreased odds across all countries in the subcontinent to be HIV+ and were identified in slightly less than 40% of the countries (range of the Nagelkerke R 2 was .1% to 4%, Figure 2A, 4th row). Second, if a woman was divorced, they had a consistently increased odds (median OR 2.4) in HIV+ relative to those not divorced. This was consistent throughout the study countries (I 2 estimate of 33%).
Males who were divorced had an inconsistent association with HIV+ (I 2 99%) across countries.
Specifically, in 13 out of 28 countries divorced males had a near zero chance of HIV+ ( Figure 2B figure). Third, males who were in a union (or living with) a woman at the time of the survey had an average odds ratio of 2.2, or greater than 2-fold increase in odds versus relative to those who were not for HIV+, and this was consistent across the study countries, with Nagelkerke R 2 reaching up to 4% ( Figure 2B, 5th row). Of note, marital status also stood out in our previous analysis of HIV+ in Zambia. [11] Complex indicators of education status emerged as a key correlate in numerous countries. For example, we report a meta-analytic OR of 4.4 across 29 countries for the female participant not attending school (Figure 2A). While this indicator may overlap with young age, low educational attainment was also associated with higher risk of HIV among men ( Figure 2B). Similarly, the meta-analytic OR across 28 countries for males who completed primary school versus who did not was extremely small, 0.004 ( Figure 2B, 8  We found several variables that reflected potential biological and co-morbid conditions with HIV+ ( Figure 3AB). For example, we identified men who were not anemic had an overall 60% decreased odds of HIV+ (with Nagelkerke R 2 up to 4%). This variable was found in 25% of the countries it was measured ( Figure 3B).

Prediction of HIV Status
The use of the top 10 variables improved prediction of HIV status relative to use of prevalence rates for risk prediction across every country in the study ( Figure 4AB, see Supplementary Information for difference between AUC and PRAUC). For some countries (e.g. eSwatini) the predicted probability of HIV risk was fairly evenly distributed across the population (Gini=0.15) ( Figure 5). On the other hand, the Gini coefficient was upwards of 0.5 in several countries (including Ethiopia, Niger, and females in Rwanda) ( Figure 5).

Discussion
Here, we use a data-driven approach for discovery of candidate risk factors for HIV at scale, and in doing so, present extensive correlates of HIV infection in Sub-Saharan Africa that span economic, biological, and environmental domains. Finally, we apply this approach for HIV risk prediction across the sub-continent, and show that risk assessment of HIV status can be improved using this non-invasive epidemiologic approach.
Several identified risk factors deserve further discussion because of their strong association with HIV status, their pervasive presence across the study countries, and their relatively large explanatory power. Being the head of the household stands out for identifying relatively large risk of HIV among females in the majority of our study countries. This is an intuitive correlate of risk, and possibly identifies women whose husbands had HIV and passed away, but its relative ubiquity has not been previously recognized. Complements of this risk factor -women living in households with a male head and who have been in a single marriage -are at lower risk of having HIV. Variables that characterize marital status commonly are also closely associated with HIV status.
Another important identified variable group relates to schooling and educational attainment among both males and females. Not attending school was a consistent correlate of increased risk among females, while completion of primary school was a protective correlate among men. Because school and educational attainment indicators are often closely correlated with one another, additional schooling correlates would be identified if our heuristic (for example, the one hot-encoded complement of attending school among females) did not eliminate them. Finding educational correlates that are in line with causal effect estimates among our top variables supports the importance of schooling in reducing HIV risk. [15] Several less intuitive variables were also commonly associated with HIV. Non-ownership of livestock is a consistent positive correlate of HIV, among both males and females. We believe that this may be more closely associated with residence in urban environments (non-ownership of agricultural land also carries elevated risk of HIV among both males and females, in fewer surveys). Notably, we do not observe urban residence among the top candidate risk factors for HIV, suggesting livestock ownership carries additional (indirect) linkages to HIV risk.
A notable advantage of our approach is its systematic assessment of the associations of all the variables assessed in DHS, creating for the first time a database of robust (e.g., reproducible association sizes across multiple waves) across all DHS-surveyed sub-Saharan Africa countries. In comparison to "candidate" association studies that examine a few or a handful of associations at a time, that a systematic approach may lead to more consistent positive, as well as negative, identification of important correlates. This highlights the importance of examining plausible pathways linking the identified variables to HIV: the consistency of the variables we identify lends them statistical strength, and that strength further benefits from real-world context.
A systematic approach also provides a database to contextualize associations. Male circumcision, originally identified in observational studies is associated with a meta-analytic relative risk of 0.52 (an absolute value RR of roughly 2) [7]. This association size would rank in the top 80 th percentile of pan-sub-Saharan Africa associations available in this report. In other words, ~1257 variables would meet this threshold among all variables assessed in men and a fraction of those that are robust across the continent ( Figure 1B). Further, we emphasize that the DHS aims to collect a representative sample of populations surveyed and we anticipate little sample selection bias among the potential database of correlates and HIV+.
Another strength of this large multi-country person-level analysis is the demonstration of HIV risk distribution in different populations. For example, we can measure the concentration of risk in different populations. We show large variation in HIV risk distribution between countries: HIV risk is highly concentrated among a small portion of the population in some countries, while risk is more distributed in the population in other countries. If all risk is concentrated in a small portion of the population, then identifying at-risk groups may be more feasible from a public health perspective than if risk is more evenly distributed across a large portion of the population. We also note that, in most countries, HIV risk is more concentrated in females than in males, and that HIV risk concentration is very high (upwards of 60%) in a handful of countries (Chad, Ghana, Togo, and Ethiopia).
There are limitations to our data-driven study. First, the associations that we identify emerge may be confounded. Second, we applied a stringent heuristic (Bonferroni pvalue and pseudo-R 2 thresholds) for identification of associations from a massive database. Therefore, some associations may be "false negative" and fail to be discussed. We have provided all associations for readers to examine the associations in the context of others in the Supplementary Tables. Third, our correlations are cross-sectional and we rule out potential reverse-causal relationships. Fourth, aside from the HIV test, variables are self-reported and may have differential error rates. As we reported [11] random error or non-differential bias will lead to reduction of association sizes toward the null. In this extensive analysis of risk factors for HIV positivity, we are able to systematically characterize HIV risk factors, including identification of under-appreciated risk factors with meaningful association sizes, as well the extent to which risk factors are shared across countries and over time. This approach can be used to better characterize HIV risk using observational data, to identify hypotheses for HIV interventions, as well as serve as a platform for developing tools for identification risk of important non-HIV outcomes.   Table Legends   Table 1. Sample sizes across sub-Saharan Africa Table 2. Distributions of odds ratios, Nagelkerke R 2 , and I 2 (heterogeneity) estimates across countries. Table S1. Distributions of odds ratios, Nagelkerke R 2 , and I 2 (heterogeneity) estimates for variables appearing in 1 country and at least 2 countries.      Figure S2. Empirical CDF of (A) Nagelkerke R2, (B) exp(absolute value(beta)) or OR, and (C) Heterogeneity (I2). Red line depicts CDF for those not identified, blue line identified (e.g., pvalue < 1e-6 and R2 > 0.001). . . 1 Figure 1.