1. Introduction
Polyfluoro-alkyl substances (PFAS) have garnered significant attention in recent years due to their presence in a wide range of consumer and industrial products and their existence in the environment [
1]. However, the characteristics that make PFAS so prevalent in our society also underscore their challenges to human and environmental health [
2].
PFAS are a group of synthetic organic compounds characterized by their perfluoroalkyl chains, which consist of carbon atoms fully saturated with fluorine atoms [
3]. This unique chemical structure results in one of the most robust and most stable bonds in organic chemistry, the carbon-fluorine (C-F) bond, which is responsible for PFAS’s exceptional resistance to heat, chemical degradation, and biological breakdown processes, earning it the name of the "forever chemical" [
4]. The fluorine atoms in PFAS molecules form a protective shield around the carbon backbone, rendering these compounds highly hydrophobic, which contributes to their utility in various industrial applications, such as the production of non-stick coatings and water-resistant textiles [
5]. However, this is also the reason behind their persistence in the environment and their ability to bioaccumulate in organisms [
6].
PFAS’s hydrophobic properties make it highly insoluble in water, which prevents the chemical from dissolving into aqueous environments, allowing it to persist in the soil, water, and sediment for extended periods [
7]. Furthermore, this hydrophobicity disrupts normal metabolic pathways, as it partitions into fatty tissues rather than remaining in aqueous solutions [
8]. This phenomenon leads to bioaccumulation in organisms, as PFAS are absorbed through ingestion or absorption and accumulate in fatty tissues over time, resulting in elevated concentrations within organisms throughout the food chain [
9].
Over the past few years, various research has revealed the growing dangers associated with PFAS on the body, encompassing concerns such as cancer, thyroid disorders, developmental anomalies in children, and immune system dysfunction [
10] [
11]. It is well known that the carcinogenic potential of PFAS has been associated with various cancers due to the PFAS-induced oxidative stress that plays a role in cellular damage and DNA mutations, contributing to cancer development [
12][
13]. Furthermore, studies have shown that by interfering with hormonal regulation systems and suppressing immune system function, PFAS can lead to disorders like hypothyroidism, developmental anomalies, such as stunted growth and delayed cognitive development in children, and neurotoxic effects [
14][
15].
However, recent studies have uncovered the notion that PFAS, in fact, also has the potential to infiltrate further and disrupt fundamental physiological processes, namely the kidneys [
16]. The kidneys, often called the body’s natural filtration and waste management system, play a pivotal role in maintaining homeostasis. Their intricate network of nephrons and tubules ensures the efficient removal of waste products, excess fluids, and electrolytes from the bloodstream [
17]. Consequently, the exploration of the relationship between PFAS exposure and kidney function has assumed a position of paramount importance in the realm of health research [
18]. Extensive research in recent years has shed light on the profound impact of PFAS on kidney health. When introduced into the body, PFAS compounds can infiltrate renal tissues, interacting with various cellular components and initiating molecular responses [
19]. These interactions can lead to structural changes in the kidneys, potentially altering the distribution of kidney types, which has been identified in previous studies. Such changes in kidney morphology have significant implications for kidney function and overall health [
20]. Moreover, studies have indicated that PFAS exposure can disrupt the finely tuned balance of hormonal regulation systems, potentially leading to disorders such as hypothyroidism, which can further affect kidney health [
21].
Therefore, this research aims to gain further insight into the relationship between PFAS and kidney function [
22]. Utilizing machine learning techniques, we aim to unravel the complexities of this relationship and shed more light on how PFAS accumulation may impact the intricate structure and function of the kidneys [
23]. Utilizing a dataset containing PFAS chemical features and kidney parameters, exploratory data analysis and dimensionality reduction were performed using PCA to identify patterns and correlations within the data. To ensure and verify the correlation between PFAS and kidneys, an XGBoost Classifier was used to predict kidney type from PFAS descriptors. Next, an XGBoost Regressor was used to estimate PFAS accumulation in the organ, assessing the impact of PFAS on the kidneys. Finally, a Random Forest Regressor was developed to determine the ratio of Glomerular Total Surface Area to Proximal Tubule Volume to offer insights into kidney function. The models were trained on 70% of the dataset and evaluated using metrics such as R-squared, confusion matrices, Mean Absolute Error, and residual analysis. Hyperparameter tuning through methods like Grid Search and Cross-Validation was also conducted.
2. Materials and Methods
2.1. Dataset
A dataset found at [
24] was used. The dataset is critical to providing a few key pieces of information in depth. The first key piece of information that it provides is the chemical features of PFAS. For example, the dataset has information about the PFAS inside of the body’s lipophilicity, the vapor pressure of PFAS, the water solubility of PFAS, and more critical descriptors of PFAS chemically. Additionally, the dataset contains information about the actual animal being looked at; for example, it has information about the species type and the gender of the animal. Furthermore, some critical physiological features of the animal were kept within the dataset, such as information about the animal’s body mass, while also keeping physiological characteristics of the kidney inside of the body, which has vital information about things like the diameter of the proximal tubules. Utilizing this entire feature set, we conducted some methods to see if the data could be used to create important machine learning algorithms proving correlations between different parts of PFAS and kidney functions.
2.2. Data Analysis
Before the machine learning algorithms were created, exploratory data analysis was undertaken to see the key patterns and statistical backing behind the data before any machine learning was run, ensuring that the algorithms depend on there being learnable patterns. Additionally, conducting the analysis would allow for essential pieces of data that would mess with the accuracy of the machine learning models in the future, like outliers or potential data errors that may be found.
The following data analysis technique was a dimensionality reduction utilizing a Principal Component Analysis (PCA). PCA is perfect for finding the key correlations and patterns in the data, which directly leads to variation. Additionally,
Figure 1 shows how the higher dimensionality is reduced to lower dimensionality spaces, making it easier to visualize data patterns while also allowing us to understand the explained variance ratio for the principal components. This allows us to see the primary source of variation in the data and how the first few components are vital in depicting the first few data points. Additionally, we can find clusters and patterns in the data, allowing us to find relationships that are not very noticeable or subgroups of data.
Furthermore, using methods like clustering on top of PCAs, we can better understand the patterns inside of the dataset [
25]. This is a crucial method because clustering while using PCA allows us to find the natural groups of the patterns within the data, as PCA simplifies the data while keeping the essential characteristics, so the clusters of this allow us to understand the groupings of these characteristics allowing us to find these specific groups within the data [
26]. Additionally, K-Means can cluster the points based on their similarity, allowing us to find the similarity between the different characteristics delineated as essential due to the PCA algorithm. As seen in
Figure 2, there seems to be an essential characteristic within the dataset that can show how some parts of the principal component analysis are very close. However, there are sections in the middle where multiple clusters are near one another.
2.3. Kidney Type Prediction
Utilizing our data, we attempted first to establish and confirm a correlation between PFAS and kidneys, which we would later conduct the rest of our analysis on. This correlation can describe if there is a relationship between the information about the PFAS and precisely how much it affects the kidney while providing the actual characteristics of the organ. To do this, we created a classifier that would classify what type of kidney was being affected by the PFAS inside of the species, which could allow for early measures and precautions to be taken based on the predicted effects of PFAS on the organ.
2.3.1. Feature Selection
The data was split up into all of the pieces of data which actively described any property or action of the PFAS chemical was used as a X variable, while just the kidney type was left as the Y variable.
2.3.2. Model Selection & Hyperparameter Tuning
The machine learning algorithm utilized an XGBoost Classifier(XGBC) to classify the PFAS data into the two different types of kidneys in the dataset. XGBC is critical to creating multiple weaker decision trees, which are then compiled up and added into a singular, more robust predictor while correcting previous tree errors by selectively creating new trees, allowing the algorithm to make the predictions more accurate [
27]. As a result, the ensemble-based algorithm is critical to creating a robust classifier, as it uses the concept that if each tree has a vital understanding of the data, then when they all are combined, the final decision that the model is going to be outputting will understand all of the dimensionalities of the dataset. Additionally, the algorithm has inbuilt regularization techniques, which is critical because the dataset that we were working with had a ton of data points, meaning that the algorithm was very prone to just learning the pattern in training data without having a resemblance to the output data because it would be overfitting.
Furthermore, XGBC allows us to look into the specific features with the highest correlations, especially within the context of the machine learning algorithm’s trained understanding of the patterns within the data [
28]. Utilizing the XGBC feature importances, we can see which columns are the most important for the training of the algorithm and which columns may potentially be negatively affecting the understanding of the data, which would allow us to go back to the feature selection step and ensure that they are removed from the training dataset. Additionally, we can see which features have the highest correlation with the target output that we are looking for using this feature selection to understand better the patterns within the data, which otherwise would not be understandable.
Then, after the model had been trained and the feature selector had been used to see which features were key, hyperparameter tuning was used to create the most optimal machine learning model, especially in the context of XGBC where there are multiple parts to the actual algorithm including important pieces like loss functions which could change. Thus, essential concepts like Grid Search and Cross-Validation were used to conduct hyperparameter tuning so the classifier could achieve the best accuracy possible.
2.3.3. Model Training & Evaluation
The model was trained using a majority of the data, but a train test split of 70% - 30% was used to evaluate the model and prevent overfitting of the data. Then, to understand the relationship between the different pieces of data, a few metrics were used, including a confusion matrix to understand how the model was doing when predicting the testing data, and an accuracy score was used to understand the difference between the predicted and the actual outputs.
2.4. PFAS Accumulation Model
After establishing if there is a correlation between the kidney type and the information about PFAS, it is vital to see the amount of PFAS that is going to accumulate in the kidney based on the PFAS descriptors. We must see this information because it is critical to assessing the impact of PFAS on the kidney. Additionally, it is essential to understand if there is a correlation between the chemical descriptors of PFAS and the actual effect of PFAS when inside the human body. As a result, a machine learning model was created to make these assessments and understand if we can predict the accumulation of PFAS inside the human body.
2.4.1. Feature Selection
A fundamental difference between this algorithm and the kidney-type prediction algorithm is that this one needs to output a numerical rather than a categorical response. This delineation is critical to understanding how the data will be used, as many features exist in the datasets. However, some of them are categorical, so the data must be encoded into numerical data. After the data was imputed, the data was split into all the PFAS descriptors: the X data. At the same time, the singular Y data was the amount of PFAS accumulating inside the body.
2.4.2. Model Selection & Hyperparameter Tuning
Similarly to the Kidney Type Prediction model, an XGBoost Regression algorithm was used to predict the amount of PFAS that would accumulate in the body. XGBoost regression was used for this algorithm due to its innate ability to understand nonlinear data through its use of gradient-boosted decision trees [
29]. Using these trees, the algorithm can perform an ensemble method, reducing overfitting inside the algorithm. Additionally, since it is a gradient-boosting algorithm, it can optimize its performance through an iterative process where it continuously improves itself [
30]. Also, it can understand larger datasets with many features, delineate which ones are the most essential parts for it to learn, and then create the most optimal outputs after it has achieved the response it is attempting to get [
31].
2.4.3. Model Training & Evaluation
The model was trained on a 70%-30% train test split. Using the train test split, the model could be evaluated on its ability to truly learn the training data and still apply it to the testing data without just outputting it based on the answers it was already given. On this, an and MAE were calculated to output the model’s accuracy to see how closely the model truly understood and predicted the data.
2.5. Glomerular Total Surface Area vs Proximal Tubule Predictor
Following the prediction of PFAS accumulation in kidneys, we aimed to discern the effects of PFAS on the kidneys’ function. To do this, we sought to estimate the ratio of Glomerular Total Surface Area (GlomTotSA) to the Volume of the Proximal Tubule (ProxTubTotVol) within the kidneys, which provides insight into the structural dynamics of the organ. A higher ratio may indicate efficient filtration and reabsorption processes, suggesting healthier kidney function. In comparison, a lower ratio might suggest potential kidney morphology and function alterations, providing early indicators of kidney health issues.
2.5.1. Feature Selection
The feature selection process involved carefully examining the relevance and significance of each feature in the dataset. Using Recursive Feature Elimination (RFE) and correlation analysis, we identified the most informative attributes for our algorithm. Furthermore, domain expertise was crucial in selecting pertinent features, ensuring that the model captured the essence of PFAS accumulation within the kidneys.
Feature engineering, however, was not confined to feature selection alone. It extended to creating engineered features, such as interaction terms between PFAS accumulation descriptors and anatomical variables, enabling the model to capture nuanced relationships within the data. These engineered features served as the basis for the algorithm to make precise predictions regarding the GlomTotSA/ProxTubTotVol ratio.
2.5.2. Model Selection & Hyperparameter Tuning
The choice of a Random Forest Regressor was deliberate, given its aptitude for handling complex datasets and capturing linear and nonlinear relationships within the data. Our dataset encompassed a multitude of variables, including PFAS exposure descriptors, anatomical features, and the target variable of GlomTotSA/ProxTubTotVol ratio. These variables interact in intricate ways that may need to be more linear and straightforward. The ensemble nature of Random Forest, comprising multiple decision trees, allows the algorithm to capture these complex relationships effectively. Each decision tree contributes its unique perspective, and the ensemble aggregates these insights to yield robust and accurate predictions. This robustness is particularly advantageous when dealing with biological data, which often exhibits intricate and nonlinear interactions. The ensemble nature of the Random Forest algorithm, comprising multiple decision trees, enhances its robustness and predictive power [
32]. Another compelling aspect of the Random Forest Regressor is its innate ability to provide insights into feature importance [
33]. Understanding which features are most influential in predicting the GlomTotSA/ProxTubTotVol ratio is essential for gaining mechanistic insights into the relationship between PFAS exposure and kidney morphology[
34]. Random Forest calculates feature importance scores, enabling us to identify which PFAS descriptors and anatomical variables are pivotal in predicting the target variable [
35].
On the other hand, hyperparameter tuning, a pivotal aspect of model development, involved an exhaustive search for optimal parameters. We employed techniques such as grid search and cross-validation to find the parameter configuration that yielded the best predictive accuracy and generalization performance.
2.5.3. Model Training & Evaluation
The Random Forest Regressor was trained on the training data, utilizing a 70/30 train test split, allowing the selected features and engineered descriptors to learn the intricate relationships between PFAS accumulation and the GlomTotSA/ProxTubTotVol ratio.
To gauge the model’s performance, we employed several evaluation metrics, including Mean Absolute Error (MAE) and , which quantified the model’s predictive accuracy. Additionally, we assessed the model’s ability to generalize to unseen data through k-fold cross-validation, ensuring robustness and mitigating overfitting concerns.
4. Biological Significance
As seen by the accuracy of the Kidney Type Prediction Model, we can safely verify previous studies and conclude that there is indeed a correlation between PFAS descriptors and kidneys to base the rest of our analysis. The ability of PFAS descriptors to predict kidney type with perfect accuracy suggests that PFAS compounds are not passive bystanders in the body but actively influence kidney morphology. As mentioned, PFAS, once absorbed into the bloodstream, can infiltrate renal tissues, where they interact with cellular components and trigger molecular responses [
36]. This interaction may lead to structural changes in the kidneys, potentially altering the distribution of kidney types [
37]. The fact that PFAS descriptors alone can reliably distinguish between different kidney types underscores the pronounced and biologically meaningful impact of PFAS exposure on renal structures.
Furthermore, the model provides significance in early diagnoses and organ assessment by predicting the amount of PFAS accumulation in the kidneys utilizing an XGBoost Regressor. Understanding how PFAS chemicals accumulate in vital organs like the kidneys is crucial for assessing the long-term health impacts of PFAS exposure. By employing XGBoost, researchers gain a powerful tool for modeling and predicting these accumulations, allowing for more accurate risk assessments and informed policy decisions. This predictive capability can guide regulatory bodies and healthcare providers in designing effective strategies to mitigate PFAS exposure and protect public health. By forecasting PFAS accumulation, we move beyond reacting to environmental contamination and instead proactively manage and mitigate its impacts.
The estimation of the ratio of Glomerular Total Surface Area (GlomTotSA) to the Volume of the Proximal Tubule (ProxTubTotVol) also offers significant scientific insight into the health and performance of our kidneys. The GlomTotSA-to-ProxTubTotVol ratio holds the key to understanding the structural efficiency of the kidneys. The glomerular surface area represents the vast expanse of tiny filtering units within the kidney, where blood is meticulously sieved to remove waste and excess substances [
38]. The volume of the proximal tubule reflects the space available for the reabsorption of essential substances into the bloodstream, a crucial process in maintaining bodily homeostasis [
39]. When this ratio is higher, it suggests that the kidneys are adept at filtration and reabsorption, indicative of a well-functioning organ [
40]. This signifies that the kidneys efficiently clear waste and retain vital compounds, reflecting good renal health. Conversely, a lower GlomTotSA-to-ProxTubTotVol ratio can signify potential kidney morphology and function issues. This may point to structural alterations within the kidneys that affect their ability to filter and reabsorb substances optimally [
41]. On a chemical level, this ratio can also be linked to the renal clearance of various substances, including drugs and toxins [
42]. Understanding these chemical processes is pivotal for predicting how the kidneys will handle different compounds, informing medication dosages and toxicity assessments.
This ratio further provides insight into the complex interplay of filtration, reabsorption, and structural integrity within the kidneys [
43]. It allows researchers to investigate the effects of PFAS exposure on these processes, potentially uncovering early indicators of kidney health issues. Moreover, it has applications beyond PFAS research, as it can be used as a valuable biomarker for assessing kidney function and diagnosing renal diseases. The ability to estimate this ratio with precision, informed by predictive PFAS accumulation models, represents a powerful tool for advancing our understanding of kidney health and clinical interventions and treatments.
However, the GlomTotSA-to-ProxTubTotVol ratio also offers a fruitful perspective on the intricate relationship between kidney structure and its functional role, particularly in the context of estimating glomerular filtration rate (eGFR) [
44]. The eGFR is a critical indicator of kidney function, primarily based on factors like creatinine levels, age, sex, and race, ultimately reflecting how efficiently the kidneys filter waste products from the bloodstream [
45]. The GlomTotSA-to-ProxTubTotVol ratio and eGFR (estimated glomerular filtration rate) in kidneys are interconnected in a complex and critical manner, biologically and chemically. This relationship is pivotal for understanding renal function, particularly how well the kidneys filter waste and maintain overall homeostasis in the body. The GlomTotSA-to-ProxTubTotVol ratio is essentially a representation of the glomerular surface area relative to the proximal tubular volume, and eGFR is an estimate of the rate at which the glomeruli in the kidneys filter blood [
46].
Biologically, this relationship hinges on the intricate anatomy and physiology of the renal system. The glomeruli, small tuft-like structures within the kidney, are the primary filtration units. They filter blood and allow water and solutes to enter the tubular system while retaining larger molecules like proteins. The proximal tubules, on the other hand, are involved in reabsorbing valuable substances such as glucose and electrolytes [
42]. The ratio of the glomerular surface area to the proximal tubular volume reflects the balance between filtration and reabsorption in the kidney. A high GlomTotSA-to-ProxTubTotVol ratio suggests efficient filtration relative to reabsorption, typically associated with better renal function [
47]. When the kidneys function optimally, the eGFR is higher because the glomeruli are efficiently filtering a greater blood volume, and waste products are being excreted. Conversely, if the ratio is skewed or compromised due to kidney damage or disease, the eGFR decreases, indicating impaired filtration capacity and potential renal dysfunction [
48].
Chemically, the interaction is also governed by the intricate molecular exchange processes within the kidney. The glomeruli filter blood by using a combination of pressure-driven physical forces and the selective permeability of their basement membranes and podocyte cells [
49]. These structures allow small molecules and water to pass into the renal tubules while preventing larger molecules like proteins from crossing over. The proximal tubules then selectively reabsorb essential molecules and regulate the concentration of electrolytes and waste products in the urine [
50]. The ratio signifies the efficiency of these filtration and reabsorption processes. When the balance between filtration and reabsorption is disrupted, such as in kidney damage, inflammation, or glomerular dysfunction, the GlomTotSA-to-ProxTubTotVol ratio can change [
51]. This imbalance decreases eGFR, as less blood is effectively filtered and more waste products may accumulate in the bloodstream [
52]. In this chemical interplay, disruptions in the GlomTotSA-to-ProxTubTotVol ratio can serve as a valuable indicator of kidney function and provide insights into renal health or dysfunction. This ratio may also signify how effectively the kidneys handle the excretion of various substances. Efficient filtration and reabsorption are vital for maintaining electrolyte balance and eliminating waste, pharmaceuticals, and toxins from the body [
53]. Therefore, if the ratio tilts towards a more significant glomerular surface area relative to the proximal tubule volume, it may suggest that the kidneys can process substances more effectively, potentially contributing to higher eGFR values.
5. Conclusion
This study demonstrates machine learning’s potential to elucidate the intricate impacts of PFAS on kidney health. The models presented confirm discernible links between PFAS and renal function. The kidney type classifier verifies that PFAS alters morphology. The PFAS accumulation regressor enables clinical monitoring to protect at-risk groups. Estimating the GlomTotSA/ProxTubTotVol ratio provides significant insights into PFAS’s effects on filtration and reabsorption efficiency.
The models showcase PFAS’s multifaceted effects on kidney structure and function. The techniques pave the way for enhanced risk assessment, improved clinical surveillance, and targeted therapeutics. This study underscores the power of machine learning to unravel PFAS toxicity mechanisms. It makes significant contributions to the current understanding of PFAS nephrotoxicity.
The high predictive accuracy, feature importance analyses, and model interpretability reveal concrete biological impacts of PFAS on kidneys. PFAS descriptors actively influence morphology and accumulation. The ratio estimation sheds light on filtration and reabsorption dynamics. These findings confirm PFAS’s pronounced effects on renal physiology.
This research demonstrates machine learning’s immense potential for elucidating PFAS nephrotoxicity. The models provide actionable clinical insights and expand mechanistic knowledge of PFAS-kidney interplay. This paves the way for more informed risk analysis, earlier diagnosis of PFAS-associated kidney damage, and targeted therapies. The study sets a strong foundation for future machine-learning investigations into PFAS toxicity.