Submitted:
11 March 2024
Posted:
11 March 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
Overview of Machine Learning Models
Machine Learning Algorithms

Linear Regression (LR)
Logistic Regression (LogReg)
K-Nearest Neighbor (KNN)
Support Vector Machine (SVM)
Decision Trees (DT) and Random Forest (RF)
Graph Convolutional Networks (GCN)
Gradient Boosting (GB)
Nonnegative Matrix Factorization + k-Means Clustering (NMFk)
Clustering
Principal Component Analysis (PCA)
Semi-Supervised Learning
Neural Networks
Deep Neural Networks (DNN)

Machine Learning Uses in Pfas
Data Source
| Data Source | Type of water | Location | References |
| GAMA | Groundwater | California | (Dong et al., 2023; George and Dixit, 2021) |
| Environment working group | (Dong et al., 2023) | ||
| National Cooperative Soil Survey (NCSS) | (Dong et al., 2023) | ||
| National Oceanic and Atmospheric Administration | (Dong et al., 2023) | ||
| National Aeronautics and Space Administration (NASA) | (Dong et al., 2023) | ||
| OECD | All | Paris | (Kwon et al., 2023; Su et al., 2023) |
| USEPA | United States | (Azhagiya Singam et al., 2020; DeLuca et al., 2023; Dong et al., 2023) | |
| PubChem Bioassay | All | (Cheng and Ng, 2019; Kwon et al., 2023) | |
| Government agencies | Surface water | Pennsylvania | (Breitmeyer et al., 2023) |
| Drinking water, Groundwater | Michigan | (Fernandez et al., 2023) | |
| Drinking water | China | (Yuan et al., 2023) | |
| Groundwater | Minnesota | (Li and Gibson, 2023) | |
| Lake and River data | Lake and River | Columbia River Basin | (DeLuca et al., 2023) |
| Norway | (Stults et al., 2023) | ||
| Experimental data | Wastewater treatment plant | China and Africa | (Jiang et al., 2023) |
| (Cao et al., 2022; Sörengård et al., 2022; Wang et al., 2022) | |||
| Aquifer | Eastern United states | (McMahon et al., 2022) | |
| Groundwater | Jiangxi, China | (Wang et al., 2022) | |
| Private wells | New Hampshire | (Hu et al., 2021) | |
| Marine | Hong kong | (Liu et al., 2023) | |
| Surface water | Chaobai river | (Hu et al., 2023) | |
| Aqueous film-forming foam impacted groundwater, Leachate, WWTP | Pulp and paper power generation industries, United States | (Joseph et al., 2023) | |
| Drinking water, Uppsala groundwater aquifer | Sweden | (Sörengård et al., 2022) | |
| Plot Digitizer | Contaminated water | (Hosseinzadeh et al., 2022) | |
| Web of Knowledge | (Han et al., 2023) | ||
| Data from around the world | (Kibbey et al., 2021a, 2021b, 2020) | ||
| Previously published data | (Patel et al., 2022) | ||
| (Karbassiyazdi et al., 2022; Kibbey et al., 2020; Patel et al., 2022) | |||
| (Cao et al., 2022) | |||
Implementation Details of the Methods
Model Evaluation Metrics
| Implementation details | Evaluation metrics | Model Performance | Reference |
| Number of estimators=1000, 10-fold CV with 500 iterations on the training set for hyperparameter | The area under Curve (AUC) | RF outperformed linear models for all of the feature subsets but for the number of nearby airports | (George and Dixit, 2021) |
| Train: Validation = 70:30, Grid search and Gaussian process techniques were used for hyperparameter tuning. Estimator number=250 | AUC | The weave model, Graph Convolutional model, Pyramidal Multitask network, and 1-hidden layer Multitask network outperformed RF for both CF and C3F6 datasets | (Cheng and Ng, 2019) |
| Train:CV = 80:20. Iterations=100. Number of trees = 1000 | MAE, RMSE, ME, AUC, Accuracy, Sensitivity and Specificity | The best-performing classification model was the 5ng/g threshold concentration, while the 1.5ng/g had the worst performance. | (DeLuca et al., 2023) |
| SMOTE and ADASYN were used to balance data. Train: test split=80:20. Training set was further split into model training: hyperparameter tuning = 80:20. Stratified CV was used for total PFAS prediction. Grid search optimization was used to tune hyperparameters | Accuracy, Precision, Recall, F-Score, and Area Under the Receiver Operating Characteristic curve (AUROC) | Based on ML model baseline performance or total PFAS prediction, RF performed the best. RF>XGB>CatBoost>LightGBM>GaussianNB>LogReg>SVM. | (Dong et al., 2023) |
| Train_test split=80:20. 5-fold CV for consistency and reduction of overfit bias. Hyperparameter tuning by Bayesian optimization | MAE, RMSE, R2. | DNN performed better than RF in accuracy (DNN>GCN>GP>RF) | (Feinstein et al., 2021) |
| Removal of highly collinear predictors before fitting. Stratified 10-fold CV calculated TP, TN, FP, and FN using a confusion matrix. | AUROC | AUROC for RF>LogReg for all PFAS modeled and the detection of any of the five PFAS. Classification RF performed well in identifying locations likely to have detectable PFAS concentrations in private wells. | (Hu et al., 2023) |
| Train_test split=4:1. Number of trees=500. A 5-fold CV was applied to evaluate fitness. MAE was used for hyperparameter tuning of CV. Number of iterations=1500. | R2, MAE, RMSE, | GBR performed better than RF. The predictive results for 25 emerging PFAS revealed that most of these compounds, such as PFOS alternatives, were recalcitrant to reductive defluorination, whereas PFECAs had relatively stronger defluorination abilities than PFPiA or diPAP. | (Cao et al., 2022) |
| Train: test split = 80:20. 5-fold CV | R2, RMSE, and Prediction error | The RF model performed well with 2D autocorrelation descriptors as the most critical features | (Hu et al., 2023) |
| Training:Validation: Test=70:20:10. Grid search + stratified 10-fold CV was used to tune hyperparameters and check for fitting. Estimators=10,100,1000. RFE was used to overcome the curse of dimensionality. | Balanced accuracy | In Case 1, RF outperformed SVC and LR in 67% of the testing set BA, SVC performed best for both the training set BA (83%) and validation set BA (50%). In cases 2 and 3 model performances do not vary much differently. | (Joseph et al., 2023) |
| Train-test split = 75%:25%. The 75% training set was further split into test and tune model following a stratified 10-fold CV method | ROC-AUC | RF achieved the highest accuracy for PFHpA and PFOS (>98%) and the lowest for total PFAS (>0.90%). | (Fernandez et al., 2023) |
| Train-test split=80:20. A 5-fold CV was applied to prevent overfitting and data wastage. Grid search hyperparameter was used for tuning. | MSE, R2 and MAE | GBM model performed better than AdaBoost and RF based on error and correlation indices. The PFOS rejection rates predicted by the RF model can reliably predict the rejection rate of PFOS during the nanofiltration process. | (Hosseinzadeh et al., 2022) |
| Train_test split=80:20. 80% of the training set was used to train the model and the remaining 20% was on the test set. Hyperparameters were selected by initial validation using a subset of the training data. Variants made use of 1000 individual decision trees. | (Kibbey et al., 2021a, 2021b; McMahon et al., 2022) | ||
| Train_test split=80:20. Grid search CV was used for hyperparameter tuning with a 10-fold CV on the training set. | AUC, Sensitivity, Specificity, Accuracy, Mathew’s correlation coefficient (MCC) | SVM performed better than RF, LogReg, KNN, and AdaBoost; suggesting that it is a suitable ML method for NR binding chemicals. | (Azhagiya Singam et al., 2020) |
| Train_test split=80:20. Optimization of hyperparameters was by Grid search and 5-fold CV for resampling. | MRE, MAE, RMSE, R2 | RF models performed best among 15 combinations | (Mu et al., 2024) |
Use Cases of Ml in Pfas
Source and Occurrence
Behavior and Pattern
Classification and Grouping
Removal Efficiency
Contamination
Model Performance
Conclusions
References
- Adu, O., Ma, X., Sharma, V.K., 2023. Bioavailability, phytotoxicity and plant uptake of per-and polyfluoroalkyl substances (PFAS): A review. Journal of Hazardous Materials 447, 130805. [CrossRef]
- Antell, E.H., Yi, S., Olivares, C.I., Ruyle, B.J., Kim, J.T., Tsou, K., Dixit, F., Alvarez-Cohen, L., Sedlak, D.L., 2023. The Total Oxidizable Precursor (TOP) Assay as a Forensic Tool for Per- and Polyfluoroalkyl Substances (PFAS) Source Apportionment. ACS EST Water acsestwater.3c00106. [CrossRef]
- Almousa, M., Olusegun, T.S., Lim, Y.H., Khraisat, I. and Ajao, A., 2023, October. Groundwater Management Strategies for Handling Produced Water Generated Prior Injection Operations in the Bakken Oilfield. In ARMA/DGS/SEG International Geomechanics Symposium (pp. ARMA-IGS). ARMA.
- Almousa, M., 2023. Characterization and treatment of Bakken oilfield produced water. https://commons.und.edu/grad-posters/5/.
- Almousa, M., Tomomewo, O.S. and Lim, Y.H., 2023. Salts Removal as an Effective and Economical Method of Bakken Formation Treatment.
- Ayati, A.H., Haghighi, A., Ghafouri, H.R., 2022. Machine Learning–Assisted Model for Leak Detection in Water Distribution Networks Using Hydraulic Transient Flows. J. Water Resour. Plann. Manage. 148, 04021104. [CrossRef]
- Azhagiya Singam, E.R., Tachachartvanich, P., Fourches, D., Soshilov, A., Hsieh, J.C.Y., La Merrill, M.A., Smith, M.T., Durkin, K.A., 2020. Structure-based virtual screening of perfluoroalkyl and poly-fluoroalkyl substances (PFASs) as endocrine disruptors of androgen receptor activity using molecular docking and machine learning. Environmental Research 190, 109920. [CrossRef]
- Banerjee, K., Bali, V., Nawaz, N., Bali, S., Mathur, S., Mishra, R.K., Rani, S., 2022. A Machine-Learning Approach for Prediction of Water Contamination Using Latitude, Longitude, and Elevation. Water 14, 728. [CrossRef]
- Breitmeyer, S.E., Williams, A.M., Duris, J.W., Eicholtz, L.W., Shull, D.R., Wertz, T.A., Woodward, E.E., 2023. Per- and polyfluorinated alkyl substances (PFAS) in Pennsylvania surface waters: A statewide assessment, associated sources, and land-use relations. Science of The Total Environment 888, 164161. [CrossRef]
- Brusseau, M.L., Guo, B., Huang, D., Yan, N., Lyu, Y., 2021. Ideal versus Nonideal Transport of PFAS in Unsaturated Porous Media. Water Research 202, 117405. [CrossRef]
- Cao, H., Peng, J., Zhou, Z., Sun, Y., Wang, Y., Liang, Y., 2022. Insight into the defluorination ability of per- and poly-fluoroalkyl substances based on machine learning and quantum chemical computations. Science of The Total Environment 807, 151018. [CrossRef]
- Cao, H., Peng, J., Zhou, Z., Yang, Z., Wang, L., Sun, Y., Wang, Y., Liang, Y., 2023. Investigation of the Binding Fraction of PFAS in Human Plasma and Underlying Mechanisms Based on Machine Learning and Molecular Dynamics Simulation. Environ. Sci. Technol. 57, 17762–17773. [CrossRef]
- Charbonnet, J.A., Rodowa, A.E., Joseph, N.T., Guelfo, J.L., Field, J.A., Jones, G.D., Higgins, C.P., Helbling, D.E., Houtz, E.F., 2021. Environmental Source Tracking of Per- and Polyfluoroalkyl Substances within a Forensic Context: Current and Future Techniques. Environ. Sci. Technol. 55, 7237–7245. [CrossRef]
- Cheng, W., Ng, C.A., 2019. Using Machine Learning to Classify Bioactivity for 3486 Per- and Polyfluoroalkyl Substances (PFASs) from the OECD List. Environ. Sci. Technol. 53, 13970–13980. [CrossRef]
- DeLuca, N.M., Mullikin, A., Brumm, P., Rappold, A.G., Cohen Hubal, E., 2023. Using Geospatial Data and Random Forest To Predict PFAS Contamination in Fish Tissue in the Columbia River Basin, United States. Environ. Sci. Technol. 57, 14024–14035. [CrossRef]
- Díaz-Galiano, F.J., Murcia-Morales, M., Monteau, F., Le Bizec, B., Dervilly, G., 2023. Collision cross-section as a universal molecular descriptor in the analysis of PFAS and use of ion mobility spectrum filtering for improved analytical sensitivities. Analytica Chimica Acta 1251, 341026. [CrossRef]
- Dong, J., Tsai, G., Olivares, C.I., 2023. Prediction of 35 Target Per- and Polyfluoroalkyl Substances (PFASs) in California Groundwater Using Multilabel Semisupervised Machine Learning. ACS EST Water acsestwater.3c00134. [CrossRef]
- Feinstein, J., Sivaraman, G., Picel, K., Peters, B., Vázquez-Mayagoitia, Á., Ramanathan, A., MacDonell, M., Foster, I., Yan, E., 2021. Uncertainty-Informed Deep Transfer Learning of Perfluoroalkyl and Polyfluoroalkyl Substance Toxicity. J. Chem. Inf. Model. 61, 5793–5803. [CrossRef]
- Fernandez, N., Nejadhashemi, A.P., Loveall, C., 2023. Large-scale assessment of PFAS compounds in drinking water sources using machine learning. Water Research 243, 120307. [CrossRef]
- García, J., Leiva-Araos, A., Diaz-Saavedra, E., Moraga, P., Pinto, H., Yepes, V., 2023. Relevance of Machine Learning Techniques in Water Infrastructure Integrity and Quality: A Review Powered by Natural Language Processing. Applied Sciences 13, 12497. [CrossRef]
- George, S., Dixit, A., 2021. A machine learning approach for prioritizing groundwater testing for per- and polyfluoroalkyl substances (PFAS). Journal of Environmental Management 295, 113359. [CrossRef]
- Guo, B., Zeng, J., Brusseau, M.L., Zhang, Y., 2022. A screening model for quantifying PFAS leaching in the vadose zone and mass discharge to groundwater. Advances in Water Resources 160, 104102. [CrossRef]
- Han, B.-C., Liu, J.-S., Bizimana, A., Zhang, B.-X., Kateryna, S., Zhao, Z., Yu, L.-P., Shen, Z.-Z., Meng, X.-Z., 2023. Identifying priority PBT-like compounds from emerging PFAS by nontargeted analysis and machine learning models. Environmental Pollution 338, 122663. [CrossRef]
- Hosseinzadeh, A., Zhou, J.L., Zyaie, J., AlZainati, N., Ibrar, I., Altaee, A., 2022. Machine learning-based modeling and analysis of PFOS removal from contaminated water by nanofiltration process. Separation and Purification Technology 289, 120775. [CrossRef]
- Hu, J., Lyu, Y., Chen, H., Cai, L., Li, J., Cao, X., Sun, W., 2023. Integration of target, suspect, and nontarget screening with risk modeling for per- and poly-fluoroalkyl substances prioritization in surface waters. Water Research 233, 119735. [CrossRef]
- Hu, X.C., Dai, M., Sun, J.M., Sunderland, E.M., 2022. The Utility of Machine Learning Models for Predicting Chemical Contaminants in Drinking Water: Promise, Challenges, and Opportunities. Curr Envir Health Rpt 10, 45–60. [CrossRef]
- Hu, X.C., Ge, B., Ruyle, B.J., Sun, J., Sunderland, E.M., 2021. A Statistical Approach for Identifying Private Wells Susceptible to Perfluoroalkyl Substances (PFAS) Contamination. Environ. Sci. Technol. Lett. 8, 596–602. [CrossRef]
- Jiang, L., Yao, J., Ren, G., Sheng, N., Guo, Y., Dai, J., Pan, Y., 2023. Comprehensive profiles of per- and poly-fluoroalkyl substances in Chinese and African municipal wastewater treatment plants: New implications for removal efficiency. Science of The Total Environment 857, 159638. [CrossRef]
- Jiang, Z., Hu, J., Tong, M., Samia, A.C., Zhang, H. (Judy), Yu, X. (Bill), 2021. A Novel Machine Learning Model to Predict the Photo-Degradation Performance of Different Photocatalysts on a Variety of Water Contaminants. Catalysts 11, 1107. [CrossRef]
- Joseph, N.T., Schwichtenberg, T., Cao, D., Jones, G.D., Rodowa, A.E., Barlaz, M.A., Charbonnet, J.A., Higgins, C.P., Field, J.A., Helbling, D.E., 2023. Target and Suspect Screening Integrated with Machine Learning to Discover Per- and Polyfluoroalkyl Substance Source Fingerprints. Environ. Sci. Technol. 57, 14351–14362. [CrossRef]
- Karbassiyazdi, E., Fattahi, F., Yousefi, N., Tahmassebi, A., Taromi, A.A., Manzari, J.Z., Gandomi, A.H., Altaee, A., Razmjou, A., 2022. XGBoost model as an efficient machine learning approach for PFAS removal: Effects of material characteristics and operation conditions. Environmental Research 215, 114286. [CrossRef]
- Kibbey, T.C.G., Jabrzemski, R., O’Carroll, D.M., 2021a. Predicting the relationship between PFAS component signatures in water and non-water phases through mathematical transformation: Application to machine learning classification. Chemosphere 282, 131097. [CrossRef]
- Kibbey, T.C.G., Jabrzemski, R., O’Carroll, D.M., 2021b. Source allocation of per- and poly-fluoroalkyl substances (PFAS) with supervised machine learning: Classification performance and the role of feature selection in an expanded dataset. Chemosphere 275, 130124. [CrossRef]
- Kibbey, T.C.G., Jabrzemski, R., O’Carroll, D.M., 2020. Supervised machine learning for source allocation of per- and polyfluoroalkyl substances (PFAS) in environmental samples. Chemosphere 252, 126593. [CrossRef]
- Kwon, H., Ali, Z.A., Wong, B.M., 2023. Harnessing Semi-Supervised Machine Learning to Automatically Predict Bioactivities of Per- and Polyfluoroalkyl Substances (PFASs). Environ. Sci. Technol. Lett. 10, 1017–1022. [CrossRef]
- Le, S.-T., Kibbey, T.C.G., Weber, K.P., Glamore, W.C., O’Carroll, D.M., 2021. A group-contribution model for predicting the physicochemical behavior of PFAS components for understanding environmental fate. Science of The Total Environment 764, 142882. [CrossRef]
- Li, R., Gibson, J.M., 2023. Predicting Groundwater PFOA Exposure Risks with Bayesian Networks: Empirical Impact of Data Preprocessing on Model Performance. Environ. Sci. Technol. 57, 18329–18338. [CrossRef]
- Li, R., MacDonald Gibson, J., 2022. Predicting the occurrence of short-chain PFAS in groundwater using machine-learned Bayesian networks. Front. Environ. Sci. 10, 958784. [CrossRef]
- Liu, Y., Wang, Q., Ma, L., Jin, L., Zhang, K., Tao, D., Wang, W.-X., Lam, P.K.S., Ruan, Y., 2023. Identification of key features relating to the coexistence mechanisms of trace elements and per- and polyfluoroalkyl substances (PFASs) in marine mammals. Environment International 178, 108099. [CrossRef]
- McMahon, P.B., Tokranov, A.K., Bexfield, L.M., Lindsey, B.D., Johnson, T.D., Lombard, M.A., Watson, E., 2022. Perfluoroalkyl and Polyfluoroalkyl Substances in Groundwater Used as a Source of Drinking Water in the Eastern United States. Environ. Sci. Technol. 56, 2279–2288. [CrossRef]
- Mu, H., Yang, Z., Chen, L., Gu, C., Ren, H., Wu, B., 2024. Suspect and nontarget screening of per- and polyfluoroalkyl substances based on ion mobility mass spectrometry and machine learning techniques. Journal of Hazardous Materials 461, 132669. [CrossRef]
- Ordonez, D., Podder, A., Valencia, A., Sadmani, A.H.M.A., Reinhart, D., Chang, N.-B., 2022. Continuous fixed-bed column adsorption of perfluorooctane sulfonic acid (PFOS) and perfluorooctanoic acid (PFOA) from canal water using zero-valent Iron-based filtration media. Separation and Purification Technology 299, 121800. [CrossRef]
- Panigrahi, N., Patro, S.G.K., Kumar, R., Omar, M., Ngan, T.T., Giang, N.L., Thu, B.T., Thang, N.T., 2023. Groundwater Quality Analysis and Drinkability Prediction using Artificial Intelligence. Earth Sci Inform 16, 1701–1725. [CrossRef]
- Patel, H., Park, H., Zhao, R., 2022. Predicting the Partitioning Behavior of Per- and Poly-Alkyl Substances (PFAS) on Liquid-Solid Interface for Carbon and Mineral Based Surfaces using Multivariate Linear Regression Models with K-Fold Cross Validation. (preprint). Chemistry. [CrossRef]
- Ragi, N.M., Holla, R., Manju, G., 2019. Predicting Water Quality Parameters Using Machine Learning, in 2019 4th International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT). Presented at the 2019 4th International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT), IEEE, Bangalore, India, pp. 1109–1112. [CrossRef]
- Raza, A., Bardhan, S., Xu, L., Yamijala, S.S.R.K.C., Lian, C., Kwon, H., Wong, B.M., 2019. A Machine Learning Approach for Predicting Defluorination of Per- and Polyfluoroalkyl Substances (PFAS) for Their Efficient Treatment and Removal. Environ. Sci. Technol. Lett. 6, 624–629. [CrossRef]
- Sörengård, M., Bergström, S., McCleaf, P., Wiberg, K., Ahrens, L., 2022. Long-distance transport of per- and poly-fluoroalkyl substances (PFAS) in a Swedish drinking water aquifer. Environmental Pollution 311, 119981. [CrossRef]
- Sosnowska, A., Bulawska, N., Kowalska, D., Puzyn, T., 2023. Towards higher scientific validity and regulatory acceptance of predictive models for PFAS. Green Chem. 25, 1261–1275. [CrossRef]
- Stults, J.F., Higgins, C.P., Helbling, D.E., 2023. Integration of Per- and Polyfluoroalkyl Substance (PFAS) Fingerprints in Fish with Machine Learning for PFAS Source Tracking in Surface Water. Environ. Sci. Technol. Lett. 10, 1052–1058. [CrossRef]
- Su, A., Cheng, Y., Zhang, C., Yang, Y.-F., She, Y.-B., Rajan, K., 2023. An Artificial Intelligence Platform for Automated PFAS Subgroup Classification: A Discovery Tool for PFAS Screening (preprint). Chemistry. [CrossRef]
- Wang, Q., Song, X., Wei, C., Ding, D., Tang, Z., Tu, X., Chen, X., Wang, S., 2022. Distribution, source identification, and health risk assessment of PFASs in groundwater from Jiangxi Province, China. Chemosphere 291, 132946. [CrossRef]
- Wang, Y., Darling, S.B., Chen, J., 2021. Selectivity of Per- and Polyfluoroalkyl Substance Sensors and Sorbents in Water. ACS Appl. Mater. Interfaces 13, 60789–60814. [CrossRef]
- Xu, Z., Lv, Z., Li, J., Shi, A., 2022. A Novel Approach for Predicting Water Demand with Complex Patterns Based on Ensemble Learning. Water Resour Manage 36, 4293–4312. [CrossRef]
- Yuan, Shideng, Wang, X., Jiang, Z., Zhang, H., Yuan, Shiling, 2023. Contribution of air-water interface in removing PFAS from drinking water: Adsorption, stability, interaction, and machine learning studies. Water Research 236, 119947. [CrossRef]
- Zeng, J., Brusseau, M.L., Guo, B., 2021. Model validation and analyses of parameter sensitivity and uncertainty for modeling long-term retention and leaching of PFAS in the vadose zone. Journal of Hydrology 603, 127172. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).