Governance and socioeconomic factors contributing to antimicrobial resistance in European countries: a data panel and machine-learning analysis

The aim of this work is to explain the behaviour of the multiresistance percentage of Pseudomona aeruginosa in some countries of Europe through a multivariate statistical analysis and machine learning validation, using data from the European Antimicrobial Resistance Surveillance System, the World Health Organization and the World Bank. First, we will use a descriptive analysis and a principal components analysis. Then, we use a kmeans clustering to determine the countries and regions that are most affected by the antibiotic resistance. Second, we expand the database by adding some socioeconomic, governance and antibiotic-consumption variables. We then run a data panel regression analysis to determine some functions that relates the multiresistance percentage with Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 8 October 2021 doi:10.20944/preprints202110.0127.v1 © 2021 by the author(s). Distributed under a Creative Commons CC BY license. those new variables. Finally, we use machine learning techniques to validate a pooling panel data case, using XGBoost and random forest algorithms. The results of the data panel analysis indicate that the most important variables for the multiresistance percentage are corruption control and the rule of law. Similar results are found with the machine learning validation analysis, where the human development index is an additional important variable for the multiresistance percentage.


Introduction
Antibiotics have completely revolutionized the world by prolonging human life.
However, human behaviour has led to antibiotic abuse, or their inappropriate use in clinical settings, which has led to increasing antibiotic resistance.
Antibiotic resistance is an evergrowing concern globally in medicine and public health. Patients infected by antibiotic-resistant bacteria require extended hospital stays, costly and several treatments that result in an economic impacts on both the patients and the healthcare system [1] Several pathogens have started to develop antibiotic resistance, particularly to first-line inexpensive broad-spectrum antibiotics, while the introduction of new drugs (e.g. EARS-Net also claims that P. aeruginosa remains one of the major causes of healthcare-associated infection in Europe. Because of its ubiquitous nature and potential virulence, P. aeruginosa is a challenging pathogen to control in healthcare settings. Antimicrobial resistance mechanisms are complex and a still unknown phenomenon, which includes well-described molecular phenomena (antibiotic-mediated selection, horizontal gene transfer, and others) [9] fostered by different social and behavioural determinants that are still unknown.
Several studies have found that the general and imprudent consumption of antimicrobials are the main causes for antimicrobial resistance. However, other factors have been suggested, such as socioeconomic factors and corruption [10] Some researches claim that higher antimicrobial resistance rates can be found in low-income and middle-income countries, in which per-person consumption of antibiotics is much lower than in high-income countries [11,12,13]. In fact, the quality of governance and public spending on health, poverty, education, and community infrastructure are known to affect health outcomes [13].
We will include socioeconomic and governance variables such as total gross domestic product, gross domestic product for health, control of corruption (among others), and a variable that represents the consumption of antibiotics.
We will use a descriptive analysis to determine the most important characteristics of the data. Then, a Principal Components Analysis (PCA) is used to establish the most influential antibiotics in the data. Then, we run a -means clustering analysis by country to determine which countries are most affected by antibiotic resistance. We then use the data panel regression method to determine the functions that relate to the multiresistance percentages with some socioeconomic, governance and antibiotic-consumption variables.
Finally, we validate the previous results using Machine Learning (ML) techniques for the pooling panel data case. We use several kinds of methods as filters to keep the most important variables of a polynomial up to six degrees. We use a threshold value filter for the covariance between the target variables. We then use the XGBoost and random forest Algorithms.

Materials and methods
In this section we describe the methods used in this study. We Figure 1 shows the geographical distribution of the countries of the European regions that are used in this study.

Data collection
The data were collected from the European Antimicrobial Resistance Surveillance We did a correlation analysis between the resistance percentages of P. aeruginosa to the antimicrobial groups under regular surveillance, including the resistance percentage to at least three antibiotics simultaneously (multiresistance percentage).
Using the R software, we first determined the confidence intervals (CI) for the resistance percentages to each antibiotic and we then computed the correlation matrix.

Principal Component Analysis
A PCA was only done on the standardized resistance percentage to the five antibiotics (we did not consider the multiresistance percentage). The new variables were ordered by the percentage of original variance that they described. A reasonable number of components should be retained to avoid an under-performing forecast. The Cattell's Scree Test [14] was chosen as a criterion of relevance.
PCA was also done to discriminate the percentage of resistance observations by region. We used a biplot-type graph (see Figure 2).
The selected variables that appeared in the dimensions for each antibiotic were identified and only the first component (eigenvalue of 4.29 and it explains 85.7% of data variance) was considered.

-means clustering
Through a -means clustering, we analysed the correlation between the EU/EEA countries with respect to the resistance percentage to P. aeuruginosa. -means is a partition technique that allows to the data to be grouped into clusters, so that the objects within a cluster are similar but objects in the other groups are different. A centroid-based partitioning method was applied, using the centroid of a cluster to represent the partition.
During the application of the technique, one centroid was defined for each cluster, and the objective function to be minimized was the square sum of the error [15], as follows where is the square sum of the error for all objects in the dataset; is the point in space representing an object; and is the centroid of each cluster. The -means clustering technique was applied to group the percentage of resistance data to identify how the countries are grouped in relation with the resistance percentages. The selection of the appropriate number of clusters is one of the most influential factors on the results of -means clustering. The Mojena criterion was used to determine the optimal number of clusters [16]. Once the best number of clusters was determined, the -means technique was applied to group the data.

Panel data analysis
Panel (or longitudinal) data provides us with observations on cross-section units (e.g. individuals, firms, industries, countries, regions), = 1,2, . . . , over repeated time periods, = 1,2, . . . , The cross-section units will be referred to as units or groups. In this study, each of the EU/EEA countries was considered as a unit. Meanwhile, the difficulty to obtain the information through the time of the individuals was found. Due to the amount of missing data and to maintain the homogeneity of the data, a polynomial interpolation of the missing data was done. However, it was not possible to make up for the lack of data from the EARSS database (only with the resistance percentage information), which forced the construction of an unbalanced panel data (i.e., when there are missing elements that result in an incomplete data series for an individual or individuals are absent in some years for some variable) with = 30 and ∈ [8,14], being Slovakia the country with the fewest time periods ( = 8) followed by Belgica ( = 10) and most of them (20) with time periods of 14. According to Hsiao [17], although many statistical proposals are built from the consideration of balanced panels, most of the empirical studies and the data that can be used only enables unbalanced panels-such as the one presented in this study.
We used the Fixed Effects method (FE). Thus, it is necessary to apply a transformation of the data to eliminate the unobserved heterogeneity, which allows the fixed effects estimator to take the form of an Ordinary Least Squares (OLS) estimator.
Thus, we proposed FE models for multiresistance percentage with the following structure: In the ML framework, we used models such as those used by [18,19]. We used the Scikit-learn package in Python. We also used the Ordinary-Least-Square method. We noted that it is possible to also use the Statsmodels package of Python. We modeled the relationship between the input variables , , . . . , , called features in the ML framework, and the output variable, , called the target variable, as a non-linear relationship with the following polynomial of degree , called the complexity on ML framework, on the variables , , . . . , Here, the variables , , . . . , were the = 9 variables given in (2). Although equation (3) is non-linear on the features , the estimation of the parameters is still a linear regression because the equation (3)  (SHAP ) Package [20]. The latter package was used to make some plots to better explain the models. We used the following hyperparameters for the XGBoost: = 0.3, max_dept = 3, subsample = 0.5, iterations = 10000. Table 2 shows the confidence intervals with 95% as confidence levels, the mean vector ( ), the standard deviation vector ( ) and the variation coefficients vector (CV ) of the resistance percentage for each antibiotic. Table 3 shows the correlation matrix.       To corroborate these results, we use a -means cluster analysis. We got three clusters, which are shown on Figure 3, which corroborated the results obtained through the PCA. Figure 3: Clusters using -means method. Tables 5 and 6 show the results of panel date analysis after removing those coefficients that are not significant ( -value greater than 0.05).

Panel data analysis
These results were extracted using EViews software.  From the results given in Table 5, we obtain the following structure for the models: R_multi = ̅ Gdp_health + ̅ Ctrl_corrup + ̅ Rule_law + ̅ DDD_sys_commun + ̅ Per + ̅ Out + ̅ HDI + C.
In Tables 6, the  and for the model is shown. Table 7 shows the results of the validation tests of the models and Tables 9 and 10    Year Effect (   that we must opt for a pooled OLS model, we opted to use ML without considering heterogeneity across time and countries. Figure 4 shows the results of ML analysis. We noted that we did not apply the polynomial features technique given in (3)   i.e., the number of terms of our polynomial (4) given by , which for = 2,3,5 are = 45, = 165, respectively. Therefore, after applying the threshold number , we should keep in mind empirically to select at most 134 terms.
Next, we present the results obtained with the Low Variance Filter. We eliminate the variables whose variance is less than a threshold value, . Apart from the Linear

Discussion
This study presented the complexity of the antibiotic resistance and multiresistance phenomena of P. Aeruginosa for the EU/EEA countries. We proposed an approach that uses a multivariate statistical analysis, a data panel strategy and machine learning techniques. Addressing this problem cannot be limited to good control practices centred on the institutions of health alone, nor private relationships between health personnel within institutions or antibiotic self-formulation. We proposed a much broader scenario where the willingness of governments to distribute their resources in basic goods such as Meanwhile, out-of-pocket expense is considered the main source of spending in low and middle income countries. This is also related to the lack of contribution from governments for public spending for health, therefore, people individually must spend money for their health. This can be correlated with antibiotic resistance, in two ways: first, the countries with the highest out-of-pocket expenses have a higher direct consumption of non-formulated antibiotics to private entities such as drugstores or pharmacies, due to distrust of the health system and because they do not find guarantees for the care of the population. Second, because they occur more frequently in low-income countries, the unavailability of resources and drug cost overruns can lead people to stop using antibiotics without following the guidelines for proper use.
As we mentioned previously, this is the first study to include HDI as a study variable for antibiotic resistance. In the multiresistance model obtained, HDI is inversely related to antibiotic resistance; that is, the higher human development in a country, the lower the antibiotic resistance. This is important given the characteristics of the indicator.
This indicator includes three variables that must be taken into account: life expectancy, schooling, and GDP total. From the initial model where GDP was included, it was considered that it was not a significant variable for antibiotic resistance, and therefore it was not included in the final model. For this reason, it can be considered that the components of life expectancy and schooling may be affecting antibiotic resistance.
The analysis of coefficients in the time series for multiresistance model showed a particular pattern of positive influence on antibiotic resistance since 2014. In comparison, the previous years show a lesser influence (except for 2005). This may be due to multiple factors that may be related to the recent awareness of the problem of antibiotic resistance and greater reporting of information, in addition to geopolitical and economic problems that have mainly affected the countries of South-East of Europe, which can be related to changes in governance systems, trust in governments and spending on basic goods.
As a recommendation for future works, more studies are required on the effect of these variables, including other bacteria, to study the problem more broadly and to be able to generate alternatives to face this antimicrobial resistance pandemic of the twenty-first century.

Conclusions
In this work, we first used a multivariate statistical analysis to understand the multiresistance percentage behaviour of Pseudomona aeruginosa to five antibiotics commonly used for its treatment using data from the EARSS, the WHO and the World Bank. We used a descriptive analysis to determine the most important characteristics of the data. The results of these analysis show that there is a positive high correlation between the resistance percentages, which indicates that there is a high linear dependence between the variables. This indicated that when there is an increase in the resistance percentage to an antibiotic, it is very likely that there is an increase in the resistance percentage to other antibiotics, and vice-versa. Based on these results, a PCA was used to establish the most influential antibiotic on the data. The Cattell's Scree Test was applied to determine the number of principal components needed to analyse the Finally, the Jarque-Bera test for the residuals showed that they do not behave under the assumption of normality. However, due to the number of observations and under a standard regression model and subject to certain regularity conditions, the residuals will behave asymptotically normal. Tables 8 and 9 show the effects of the coefficients discriminated by time and country. These results were consistent with the results obtained in the cluster analysis of Section 2.3.3, where it could be established that the countries that contribute the most to antibiotic resistance are those found mainly in the South-Eastern region of Europe, particularly Romania, Slovakia and Croatia, and the countries that contribute least to antibiotic resistance are the North-Western countries. It is striking that Cyprus, being a South-Eastern country, is the one that contributes the least to antibiotic resistance, this may be because being an island it is possible that there are no neighborhood effects and its foreign relations are more common with countries such as Turkey and North Africa than with continental European countries.
We can observe from