Submitted:
12 May 2026
Posted:
13 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- A comprehensive ML framework was developed for predicting polymer properties using the PolyOne dataset, addressing nonlinear SPR challenges.
- Data preprocessing, feature extraction using molecular descriptors and topological indices, and feature selection using Recursive Feature Elimination were systematically performed.
- An XGBoost predictive model was implemented, and its hyperparameters were optimized using the Starfish Optimization Algorithm to ensure accurate and efficient predictions.
- Model interpretability was achieved using SHAP for global feature importance and LIME for local explanation, identifying key molecular descriptors influencing polymer properties.
2. Literature Review
2.1. Problem Statement
3. Proposed Methodology
3.1. Feature Extraction of Polymer Structures
3.1.1. Molecular Descriptors
3.1.2. Topological Indices
3.2. Feature Selection Using Recursive Feature Elimination
3.2.1. Recursive Feature Elimination
| Algorithm 1: Feature Selection using RFE |
| Input: Feature matrix , target vector y, desired number of features Output: Selected feature subset 1. Initialize feature set all features in 2. While the number of features in : a. Train model M on to predict b. Compute feature importance scores for each feature in c. Identify feature with the lowest importance d. Remove from 3. Return = |
3.3. Machine Learning-Based Predictive Modeling Using XGBoost
3.4. Hyperparameter Optimization Using the SOA
3.5. Structure-Property Relationship Analysis Using SHAP and LIME
4. Experimental Setup
4.1. Dataset Description
4.2. Data Preprocessing
4.2.1. Data Cleaning
4.2.2. Normalization Using Min-Max Scaling
| Algorithm 3: Data Preprocessing for Polymer Property Prediction |
| Input: Raw dataset with features and target Output: Cleaned and normalized dataset 1. Load dataset 2. Handle missing values: a. For each feature in : i. If missing values exist, replace with mean/median or remove row 3. Normalize features using Min-Max scaling: a. For each feature in : i. x = 4. Split the dataset into: a. Training set (70%) b. Validation set (15%) c. Test set (15%) 5. Return |
4.3. Performance Metrics
R2 (Coefficient of Determination)
MAE
Root Mean Square Error
Mean Squared Error
5. Results and Discussion
| Category | Component | Specification / Version |
|---|---|---|
| Software | scikit-learn | 1.4.0 |
| SOA | custom | |
| RDKit | 2023.9 | |
| SHAP | 0.44.0 | |
| LIME | 0.2.0.1 | |
| Pandas | 2.1.4 | |
| NumPy | 1.26.3 | |
| Pyarrow | 14.0.1 | |
| Scipy | 1.11.4 | |
| Hardware | CPU Model | Intel Core i7-12700H |
| CPU Max Boost | 4.70 GHz | |
| RAM Capacity | 16 GB DDR5 | |
| GPU Model | NVIDIA GeForce RTX 3060 (Laptop) | |
| Storage Type | NVMe SSD | |
| Storage Capacity | 512 GB | |
| Storage Read Speed | 3500 MB/s |
| Model | Hyperparameter | Value |
|---|---|---|
| N_estimators | Amount of Boosting Trees | 2450 |
| Max_depth | Extreme Tree Depth | 8 |
| Learning_rate | Learning Rate η | 0.0312 |
| Subsample | Row Subsampling Ratio | 0.82 |
| Colsample_bytree | Column Fraction per Tree | 0.75 |
| Colsample_bylevel | Column Fraction per Level | 0.68 |
| Colsample_bynode | Column Fraction per Node | 0.72 |
| Min_child_weight | Minimum Sum of Hessian in Leaf | 3 |
| Max_delta_step | Max Delta Step per Tree | 2 |
| Base_score | Initial Prediction Score | 0.5 |
| N_jobs | Parallel CPU Threads | -1 |
| N_features | Features Selected via RFE | 150 |
| N_trials | Optimisation Trials | 50 |
| Cv_folds | Cross-Validation Folds | 5 |
| Scaler | Feature & Target Scaler | Min Max Scaler [0,1] |
| Rfecv_step | RFE Step Size | 0.05 |
| Rfecv_cv | RFE Cross-Validation Folds | 3 |
| Rfecv_scoring | RFE Scoring Metric | R2 |
5.1. Correlation Among Actual and Predicted Values Performance Metrics
5.2. Model Convergence Analysis Using RMSE Learning Curve
5.3. Model Convergence Analysis Using MAE Learning Curve
5.4. Overall Model Performance Evaluation
5.5. Residual Plot of Predicted and Actual Glass
5.6. Feature Importance Analysis Using SHAP Beeswarm Plot
5.7. Local Feature Contribution Analysis Using LIME
5.8. Residual Distribution Analysis
5.9. Comparison with Existing Methods
5.10. Discussion
6. Conclusion and Future Works
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Basova, T. V., Belykh, D. V., Vashurin, A. S., Klyamer, D. D., Koifman, O. I., Krasnov, P. O., Lomova, T. N., Loukhina, I. V., Motorina, E. V., Pakhomov, G. L., Polyakov, M. S., Semeikin, A. S., Stuzhin, P. A., Sukhikh, A. S., & Travkin, V. V. (2023). Tetrapyrrole Macroheterocyclic Compounds. Structure–Property Relationships. Journal of Structural Chemistry, 64(5), 766–852. [CrossRef]
- Bicerano, J. (1993). Prediction of polymer properties. Marcel Dekker.
- Chong, S. S., Ng, Y. S., Wang, H.-Q., & Zheng, J.-C. (2024). Advances of machine learning in materials science: Ideas and techniques. Frontiers of Physics, 19(1), 13501. [CrossRef]
- Ding, F., Liu, L.-Y., Liu, T.-L., Li, Y.-Q., Li, J.-P., & Sun, Z.-Y. (2023). Predicting the Mechanical Properties of Polyurethane Elastomers Using Machine Learning. Chinese Journal of Polymer Science, 41(3), 422–431. [CrossRef]
- Fattouche, M., Belaidi, S., Abchir, O., Al-Shaar, W., Younes, K., Al-Mogren, M. M., Chtita, S., Soualmia, F., & Hochlaf, M. (2024). ANN-QSAR, Molecular Docking, ADMET Predictions, and Molecular Dynamics Studies of Isothiazole Derivatives to Design New and Selective Inhibitors of HCV Polymerase NS5B. Pharmaceuticals, 17(12), 1712. [CrossRef]
- Günay, M. E., Tapan, N. A., & Akkoç, G. (2022). Analysis and modeling of high-performance polymer electrolyte membrane electrolyzers by machine learning. International Journal of Hydrogen Energy, 47(4), 2134–2151. [CrossRef]
- Hatakeyama-Sato, K. (2023). Recent advances and challenges in experiment-oriented polymer informatics. Polymer Journal, 55(2), 117–131. [CrossRef]
- Hu, Y., Li, Y., Zhang, Y., Ding, S., Wang, R., & Xia, R. (2024). Design methodology for functional gradient star-shaped honeycomb with enhanced impact resistance and energy absorption. Materials Today Communications, 38, 108020. [CrossRef]
- Kazemi-Khasragh, E., Fernández Blázquez, J. P., Garoz Gómez, D., González, C., & Haranczyk, M. (2024). Facilitating polymer property prediction with machine learning and group interaction modelling methods. International Journal of Solids and Structures, 286–287(1), 112547. [CrossRef]
- Kibrete, F., Trzepieciński, T., Gebremedhen, H. S., & Woldemichael, D. E. (2023). Artificial Intelligence in Predicting Mechanical Properties of Composite Materials. Journal of Composites Science, 7(9), 364. [CrossRef]
- Kosicka, E., Krzyzak, A., Dorobek, M., & Borowiec, M. (2022). Prediction of Selected Mechanical Properties of Polymer Composites with Alumina Modifiers. Materials, 15(3), 882. [CrossRef]
- Krevelen, D. W. van. (2009). Properties of polymers: Their correlation with chemical structure: their numerical estimation and prediction from additive group contributions (4th, completely rev. ed. / K. te Nijenhuis. eds.). Elsevier.
- Kuenneth, C., & Ramprasad, R. (2023). polyBERT: A chemical language model to enable fully machine-driven ultrafast polymer informatics. Nature Communications, 14(1), 4099. [CrossRef]
- Li, X., Chua, J. W., Yu, X., Li, Z., Zhao, M., Wang, Z., & Zhai, W. (2024). 3D-Printed Lattice Structures for Sound Absorption: Current Progress, Mechanisms and Models, Structural-Property Relationships, and Future Outlook. Advanced Science, 11(4), 2305232. [CrossRef]
- Li, Z., Jiang, M., Wang, S., & Zhang, S. (2022). Deep learning methods for molecular representation and property prediction. Drug Discovery Today, 27(12), 103373. [CrossRef]
- Lv, G., & Zhu, J. (2026). Intrinsic Thermal Conductivity of Molecular Engineered Polymer. Advanced Functional Materials, 36(6), 2420708. [CrossRef]
- Milad, A., Hussein, S. H., Khekan, A. R., Rashid, M., Al-Msari, H., & Tran, T. H. (2022). Development of ensemble machine learning approaches for designing fiber-reinforced polymer composite strain prediction model. Engineering with Computers, 38(4), 3625–3637. [CrossRef]
- Nistane, J., Chen, L., Lee, Y., Lively, R., & Ramprasad, R. (2022). Estimation of the Flory-Huggins interaction parameter of polymer-solvent mixtures using machine learning. MRS Communications, 12(6), 1096–1102. [CrossRef]
- Parambil, V., Tripathi, U., Goyal, H., & Batra, R. (2025). Polymer Property Prediction Using Machine Learning. In K. Roy & A. Banerjee (Eds.), Materials Informatics III: Polymers, Solvents and Energetic Materials (pp. 119–147). Springer Nature Switzerland. [CrossRef]
- Phua, Y. K., Fujigaya, T., & Kato, K. (2023). Predicting the anion conductivities and alkaline stabilities of anion conducting membrane polymeric materials: Development of explainable machine learning models. Science and Technology of Advanced Materials, 24(1), 2261833. [CrossRef]
- polyOne Data Set—100 million hypothetical polymers including 29 properties. (2022). https://zenodo.org/records/7766806.
- Sampedro, G. A. R., Rachmawati, S. M., Kim, D.-S., & Lee, J.-M. (2022). Exploring Machine Learning-Based Fault Monitoring for Polymer-Based Additive Manufacturing: Challenges and Opportunities. Sensors, 22(23), 9446. [CrossRef]
- Singh, G., & Chandra, S. (2023). Unravelling the structural-property relations of porphyrinoids with respect to photo- and electro-chemical activities. Electrochemical Science Advances, 3(1), e2100149. [CrossRef]
- Sobuz, Md. H. R., Khatun, M., Kabbo, Md. K. I., & Sutan, N. M. (2025). An explainable machine learning model for encompassing the mechanical strength of polymer-modified concrete. Asian Journal of Civil Engineering, 26(2), 931–954. [CrossRef]
- Sofos, F., Papakonstantinou, C. G., Valasaki, M., & Karakasidis, T. E. (2023). Fiber-Reinforced Polymer Confined Concrete: Data-Driven Predictions of Compressive Strength Utilizing Machine Learning Techniques. Applied Sciences, 13(1), 567. [CrossRef]
- Starkova, O., Gagani, A. I., Karl, C. W., Rocha, I. B. C. M., Burlakovs, J., & Krauklis, A. E. (2022). Modelling of Environmental Ageing of Polymers and Polymer Composites—Durability Prediction Methods. Polymers, 14(5), 907. [CrossRef]
- Tamasi, M. J., Patel, R. A., Borca, C. H., Kosuri, S., Mugnier, H., Upadhya, R., Murthy, N. S., Webb, M. A., & Gormley, A. J. (2022). Machine Learning on a Robotic Platform for the Design of Polymer–Protein Hybrids. Advanced Materials, 34(30), 2201809. [CrossRef]
- Thomas, A. J., Barocio, E., & Pipes, R. B. (2022). A machine learning approach to determine the elastic properties of printed fiber-reinforced polymers. Composites Science and Technology, 220(7), 109293. [CrossRef]
- Timmanaikar, S. T., Hayat, S., Hosamani, S. M., & Banu, S. (2024). Structure–property modeling of coumarins and coumarin-related compounds in pharmacotherapy of cancer by employing graphical topological indices. The European Physical Journal E, 47(5), 31. [CrossRef]
- Xu, C., Lei, C., Wang, Y., & Yu, C. (2022). Dendritic Mesoporous Nanoparticles: Structure, Synthesis and Properties. Angewandte Chemie, 134(12), e202112752. [CrossRef]
- Xue, X., Shen, G., & Liao, J. (2024). Thermodynamic property of sandwich cylindrical shell structure with metallic wire mesh: Numerical modeling and experimental analysis. Chinese Journal of Aeronautics, 37(1), 138–152. [CrossRef]
- Yakoubi, S. (2025). Sustainable Revolution: AI-Driven Enhancements for Composite Polymer Processing and Optimization in Intelligent Food Packaging. Food and Bioprocess Technology, 18(1), 82–107. [CrossRef]
- Yan, C., & Li, G. (2023). The Rise of Machine Learning in Polymer Discovery. Advanced Intelligent Systems, 5(4), 2200243. [CrossRef]
- Yue, D., Feng, Y., Liu, X., Yin, J., Zhang, W., Guo, H., Su, B., & Lei, Q. (2022). Prediction of Energy Storage Performance in Polymer Composites Using High-Throughput Stochastic Breakdown Simulation and Machine Learning. Advanced Science, 9(17), 2105773. [CrossRef]
- Zhang, B., Luo, C., Jiang, H., Feng, S., Li, X., Zhang, B., & Ye, Y. (2023). Adaptive Transfer of Graph Neural Networks for Few-Shot Molecular Property Prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(6), 3863–3875. [CrossRef]
- Zhang, Q., Huang, J., Wang, K., & Huang, W. (2022). Recent Structural Engineering of Polymer Semiconductors Incorporating Hydrogen Bonds. Advanced Materials, 34(26), 2110639. [CrossRef]
- Zhang, Z., Liu, Q., & Wu, D. (2022). Predicting stress–strain curves using transfer learning: Knowledge transfer across polymer composites. Materials & Design, 218(5), 110700. [CrossRef]
- Zhao, Y., Mulder, R. J., Houshyar, S., & Le, T. C. (2023). A review on the application of molecular descriptors and machine learning in polymer design. Polymer Chemistry, 14(29), 3325–3346. [CrossRef]












| Property | Full Dataset Mean | Subset Mean | Difference (%) |
|---|---|---|---|
| Molecular Weight | 412.6 | 409.8 | 0.68 |
| Glass Transition Temperature | 381.5 | 378.9 | 0.68 |
| Melting Temperature | 524.2 | 519.7 | 0.86 |
| LogP | 3.42 | 3.39 | 0.88 |
| Rotatable Bonds | 6.84 | 6.79 | 0.73 |
| Property | R2 | MAE | RMSE |
|---|---|---|---|
| 0.9886 | 0.0420 | 0.0504 | |
| 0.9721 | 0.0463 | 0.0571 | |
| 0.9654 | 0.0512 | 0.0618 |
| Random Seed | RMSE | MAE | R² Score |
|---|---|---|---|
| 10 | 0.124 | 0.091 | 0.962 |
| 20 | 0.127 | 0.093 | 0.960 |
| 30 | 0.122 | 0.089 | 0.964 |
| 40 | 0.126 | 0.092 | 0.961 |
| 50 | 0.123 | 0.090 | 0.963 |
| Average | 0.124 | 0.091 | 0.962 |
| Model | MAE | RMSE | MSE | R2 |
|---|---|---|---|---|
| ML + Group Interaction Modeling | 0.041 | 0.0498 | 0.0024 | 0.981 |
| Super learning for composite materials | 0.043 | 0.051 | 0.0026 | 0.979 |
| ML for Flory Huggins Parameter Prediction | 0.044 | 0.050 | 0.0028 | 0.985 |
| XGBoost + SOA (proposed) | 0.0420 | 0.0504 | 0.0025 | 0.988 |
| Process | Execution Time (s) |
|---|---|
| Data Preprocessing | 12.4 |
| Feature Selection (RFE) | 38.7 |
| SFOA Optimization | 65.2 |
| Model Training | 24.5 |
| Total Execution Time | 140.8 |
| Model | R² | MAE | RMSE |
|---|---|---|---|
| polyBERT | 0.972 | 0.0584 | 0.0692 |
| Proposed XGBoost–SFOA | 0.9886 | 0.042 | 0.0504 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).