Machine Learning Applications in Biofuels’ Life Cycle: Soil, Feedstock, Production, Consumption, and Emissions

: Machine Learning (ML) is one of the major driving forces behind the fourth industrial rev-olution. This study reviews the ML applications in the life cycle stages of biofuels, i.e., soil, feedstock, production, consumption, and emissions. ML applications in the soil stage were mostly used for satellite images of land to estimate the yield of biofuels or a suitability analysis of agricultural land. The existing literature have reported on the assessment of rheological properties of the feedstocks and their effect on the quality of biofuels. The ML applications in the production stage include estimation and optimization of quality, quantity, and process conditions. The fuel consumption and emissions stage include analysis of engine performance and estimation of emissions temperature and composition. This study identiﬁes the following trends: the most dominant ML method, the stage of life cycle getting the most usage of ML, the type of data used for the development of the ML-based models, and the frequently used input and output variables for each stage. The ﬁndings of this article would be beneﬁcial for academia and industry-related professionals involved in model development in different stages of biofuel’s life cycle. yield estimation, and (3) optimization of quality and yield. In the


Introduction
Machine Learning (ML) is one of the major forces driving the fourth industrial revolution, typically known as Industry 4.0. ML enables a computer system to solve complex research questions through implicit and automated "learning" that additionally selfimproves without being explicitly preprogrammed to do so. From an algorithmic point of view, the term machine refers to an automated process that incrementally updates its problem-solving capability through successive iterations based on inputs from external variants. The concept of ML was introduced by Samuel [1], one of the pioneers of modern Artificial Intelligence (AI).
Machine Learning methods are broadly classified into supervised learning, unsupervised learning, and reinforcement learning. The supervised learning is performed on labeled output for a given set of input [2]. The trained algorithm is then used to predict

Methodology
For literature retrieval, the Web of Science and Google Scholar databases were mainly used. To document the literature in a logical and succinct way accommodating in a review paper, only research articles were reviewed. The dominant part of the literature was collected from the Web of Science database; however, Google Scholar was also used for collecting some gray literature regarding ML applications in soil stage and feedstock stages. Keywords listed in Table 1 were used to search relevant literature in the Web of Science database. Table 1. Keywords used in literature survey.

Categories Keywords Used for Search in the Databases
Machine Learning artificial neural networks | boosting | data-based | data-driven | decision tree | deep learning | dimensionality reduction algorithms | discriminant analysis | ensemble learning | estimation | extreme learning machine | genetic algorithm | inference | kNN | K-Means | least-squares | logistic regression | linear regression | machine learning | moving average | multi layered perception | Naive Bayes | neuro fuzzy | partial least squares | principal component analysis | prediction | random forest(s) | soft sensor | support vector machine | virtual sensor Biofuels bioalcohol | biodiesel | bioethers | biofuels | biogas | biohydrogen | dimethylfuran | green diesel Soil drone | land | land image | satellite | soil | surveillance Feedstock algae and aquatic biomass | biomass | biosolid(s) | corn | energy cane | feed | feedstock | forest thinning | high biomass sorghum | hybrid poplars | industrial waste gases | logging residues | lignocellulosic crops | lignocellulosic residues | manure slurries | micro algae | miscanthus | municipal waste | oil-based residues | oil crops | organic residues | plant | plastics | raw material(s) | shrub willows | sludge | starch crops | sugar crops | sweet sorghum | switch grass | vegetable oil | waste | waste food | waste gases Production catalytic synthesis | distillation | drying | fermentation | gas cleaning | gasification | operation | process | product | production | reactor | refining | unit | water gas shift | yield Consumption & emissions air pollution | carbon emission | emission | energy potential | engine | environment | exhaust gases | fuel consumption | fuel quality | fuel use | green house gases | greenhouse | mileage The following rules were applied while deciding the relevance of the articles: (i) the title of the article should contain at least one word from each of the ML and biofuels categories of keywords as shown in Table 1, and (ii) the title should also contain at least one word from any of the other four categories, i.e., soil, feedstock, production, consumption, and emissions. The collected literature was divided into four stages of the life cycle on the basis of keywords found in the second step of the above-mentioned procedure.
In the first round, the number of papers related to soil, feed stock, production, and consumption and emission stages were 10, 91, 90, and 75, respectively. The collected literature was initially studied to screen-out the irrelevant papers. Some papers were shifted across the stages. After the screening, the production stage contained the largest number of papers with 64 papers on ML applications followed by consumption and emission with 43 papers, feedstock with 12 papers, and soil with 5 papers.

Applications of ML Methods in the Life Cycle of Biofuels
In this section, the stage-wise application of Machine Learning (ML) methods is discussed. This section is divided into the following subsections: Section 3.1 Soil, Section 3.2 Feedstock, Section 3.3 Production and Section 3.4 Consumption, Engine Performance and Emission Stages. A summary of applications is presented in Section 3.5.

Soil
Several studies on ML application in the soil stage of the life cycle of biofuels have been reported both at the tree and plot levels. The ML applications in soil phase are summarized in the Table 2 and discussed below. Table 2. Summary of ML applications in the soil phase.

ML Method Input Variables Output Variables Error Range References
Linear Mixed-Effects (LME) regression, Random Forest (RF), Support Vector Regression (SVR)

Tree Crowns
Biomass estimation [19] Extremely For example, Gleason and Im [19] compared the Linear Mixed-Effects (LME) regression, Support Vector Regression (SVR), Random Forest (RF), and Cubist for prediction of biomass in a 40% to 60% canopy closure forest. SVR outperformed the other methods. Huntington et al. [20] used RF to predict future trends in sorghum bicolor yield under two irrigation regimes and four Greenhouse Gas (GHG) emission scenarios. The RF model trained on uniquely identified data, identified by year and country, achieved reasonable prediction accuracy. In another study by Habyarimana [21], various ML methods were used for the prediction of sorghum biomass yield based on satellite images of sorghum fields. The ML methods included PLS Discriminant Analysis (PLS-DA), PCA Discriminant Analysis (PCA-DA), ANN, RF, SVM with Nonlinear Kernel (SVM-G), SVM with Radial  basis Kernel (SVM-R), SVM with Radial basis Kernel with Polynomial basis Kernel (SVM-P), SVM with Linear Classifier (SVM), eXtreme Gradient Boosting-xgbtree method (GBT), eXtreme Gradient Boosting-xgbLinear method (GBL), eXtreme Gradient Boosting-xgb DART method (GBD), and a simple linear model. The GBT method outperformed the other methods. Lee et al. [22] used the Boosted Regression Tree (BRT) model to asses environmental impacts of corn production for the years 2022 to 2100. The study was conducted in the context of four emissions scenarios where the BRT model achieved correlation coefficients of 0.82 and 0.78, in estimating eutrophication impacts and global warming, respectively. Yang et al. [23] used a two-step ML approach. Initially, the Gaussian Process Model (GPM) was used for crop yield down-scaling followed by yield estimation through the RF model. The GPM, a Bayesian inference method, helped in realizing accurate estimations.

Feedstock
ML applications in the feedstock stage of biofuels have received attention in the literature, mostly recently. The ML applications in feedstock phase are summarized in Table 3 and discussed below.
Mahanty et al. [24] used ANN and statistical regression models to predict specific methane yield in the production of biogas from industrial sludges. The ANN model performed better than the statistical regression model. It was revealed that sludges from the chemical industry have a relatively higher impact on methane in the produced biogas. Mairizal et al. [25] used multiple linear regressions to predict viscosity, Flash Point (FP), density, higher heating value, and oxidative stability of biodiesel produced from sunflower oil, peanut oil, hydrogenated coconut oil, hydrogenated copra oil, beef tallow, rapeseed oil, and walnut oil. It was inferred from the results that the addition of PU/MU as an independent parameter increase prediction performance. Giwa et al. [26] used ANN to estimate Cetane Number (CN), Kinematic Viscosity (KV), Flash Point (FP), and density of biodiesel produced from fatty acid. Accuracy of estimation and the average of absolute deviation of the model were as follows: CN (96.6%; 1.637%), FP (99.07%; 0.997%), KV (95.80%; 1.638%), and density (99.40%; 0.101%). Tchameni et al. [27] used Multiple Non-Linear Regression (MNLR) and ANN for prediction of rheological properties of waste vegetable oil for production of biodiesel. The ANN model had superior performance over the MNLR method. Dahunsi [28] used single and multiple linear regressions to estimate methane yield in biomass structural components. A fairly high correlation was found between the chemical composition and methane potentials of the biomass.
Reimann et al. [29] used ML methods such as Naive Bayes, RF and ANN for the classification of micro-algae cells, with RF outperforming the other ML techniques. It was inferred that pairing the RF based modeling framework with microscopic features of samples may result with high-resolution distinction and quantification of different species within a lesser time frame when compared with the conventional lab based approach. Tang et al. [30] applied MLR and RF for the prediction of the yield and hydrogen of bio-oil using information of biomass compositions and pyrolysis conditions. The results verified that RF has a better performance when compared with MLR. Shahbeig et al. [31] developed an SVR-based model to predict the thermal characteristics of biomasses. The predicted results were aligned with experimental findings with a correlation coefficient of 0.9999. Ighalo et al. [32] developed LRA and Stochastic Gradient Descent (SGD) based models to predict the Higher Heating Value (HHV) of biomass. The LRA model was observed to be more accurate. Kumar et al. [33] developed Artificial Neural Network (ANN) coupled with Genetic Algorithm (GA) to increase the lipid yield. The input parameters were glycerol, NH 4 Cl, MgSO 4 and KH 2 PO 4 , which were screened by Plackett Burman design. The obtained value of regression correlation coefficient value was 0.9918. Maximum biomass concentration of 15.16 ± 0.69 g L −1 was achieved with 0.49 ± 0.02 g lipid per gram of yeast. By examining the lipid composition, the main fatty acids revealed, in order of their relative richness (% w/w), were oleic, tridecanoic, palmitic, stearic, palmitoleic and linoleladic acids. Cheng et al. [34] developed RF model for biochar production through slow pyrolysis process. Output of the model were yields and quality of biochar produced from slow pyrolysis from different feedstocks. The feedstock compositions, reaction temperature, resistance time, and heating rate were used as the model inputs. The model outputs were used with Life Cycle Assessment (LCA) and economic analysis to find net Global Warming Potential (GWP), Energy Return on Investment (EROI), and Minimum Product Selling Price (MPSP). Cheng et al. [35] developed MLR, regression tree (RT), and RF models to predict yields and characteristics of products (biocrude, hydrochar, gas, and aqueous co-product) from Hydrothermal Treatment (HTT) of various feedstocks. Feedstocks' characteristics together with reaction temperature, reaction time, and initial concentration were used as inputs of the models. The model outputs were used with LCA and economic analysis to find net GWP, and energy return on investment EROI.

Biodiesel
The studies reported in the biodiesel production phase are further classified based on the output of the ML models, i.e., quality, yield, and process efficiency.

Quality Estimation
Soltani et al. [36] used the ANN model to determine optimum conditions to get the desired nanocrystalline size of mesoporous SO3HZnO catalyst. Optimized conditions were calcine temperature of 700 • C, 160 • C reaction temperature, 18 min reaction time, and 4 mmol of Zn concentration. Ahmad et al. [37] used Least Squares Boosting (LSBoost) integrated with polynomial chaos expansion method in the production of vegetable oil based biodiesel under uncertainty. The average Mean Absolute Deviation Percent (MADP) values in the predicted values of the target output were 0.84 in response to 1% uncertainty in each input variable of the models. Gulum et al. [38] used regression and ANN models to predict viscosity and density of ternary blends consisting of biodiesel, diesel, and vegetable oil. Exponential and rational models previously reported in the literature were compared with regression models and the ANN approach. Tomaz-zoni [39] used PCA to estimate viscosity, relative density, and percentage conversion of vegetable oil to methyl esters in the production of diesel from vegetable oil. Through the use of PCA, they were able to differentiate between pure samples waste oil, diesel, and biodiesel from their respective blends. Sarve et al. [40] used ANN and Central Composite Design (CCD) to predict Fatty Acid Methyl Ester (FAME) content in the production of biodiesel from sesame oil. The study revealed that catalyst concentration has the highest impact on the FAME contents in the final product. The ANN model showed better performance.

Yield Estimation
Several studies based on biodiesel yield prediction through ML methods have been reported in the literature where the biodiesel was formed from jatropha-algae, castor oil, and anaerobic sludge, e.g., [41][42][43]. Kumar et al. [41] developed an ANN model to predict biodiesel yield using various jatropha-algae oil blends as inputs. Banerjee et al. [42] used an ANN model for predicting the fractional formation of FAME. They also devised a kinetic model using the experimental and computed data. The experimental and the ANN-based predicted data were used to estimate the rate constants of the kinetic model. The ANN model was able to predict the % FAME yield within 8% deviation. Kanat and Saral [43] used ANN to estimate the production rate of biodiesel from anaerobic sludge in a thermophilic up-flow anaerobic sludge blanket reactor. Longer time periods for the moving average showed a higher correlation coefficient of 0.927.
In several studies, ANN was used together with Response Surface Methodology (RSM) for the prediction of the yield of biodiesel from: Jatropha-algae oil by Kumar et al. [44], goat tallow by Chakraborty and Sahu [45], and enterobacter species by Pandu et al. [46]. In a study by Kumar et al. [44], ANN outperformed the RSM. Chakraborty and Sahu et al. [45] used RSM and ANN for identifying optimal parametric values that result in maximum FA conversion while maintaining FAME yield that met the American Society for Testing and Materials (ASTM) biodiesel specifications. ANN and RSM had comparable predictability performance. Pandu et al. [46] compared performance of RSM model and ANN model. The ANN model outperformed the RSM model.
Kumar et al. [47] used ANN and Linear Regression (LR) to predict soybean-based biodiesel yield where the ANN outperformed the LR. Moradi et al. [48] used ANN and kinetic models to estimate the yield of biodiesel from Soybean oil. The ANN exhibited the capability of learning from experimental data and is simple to apply in comparison to the classic kinetic modeling method. Guo and Baghban [49] and Mostafaei et al. [50] used Adaptive Neuro-Fuzzy Interference System (ANFIS) to estimate biodiesel yield from algae oil blend, vegetable oil, and waste cooking oil. They achieved a low absolute deviation between experimental data and ANFIS model based predicted data. It confirmed suitability of ANFIS for prediction of the biodiesel yield from vegetable oil. Maran and Priya [51] used ANN and RSM to predict biodiesel yield from muskmelon oil. The ANN model, again, outperformed the RSM model.

Quality and Yield Optimization
Several studies have focused on the optimization of the yield and quality of biodiesel. Bobadilla et al. [52] used a GA-based SVM to estimate and optimize biodiesel yield of specific properties such as higher heating value with decreased viscosity, density, and turbidity from waste cooking oil. To produce biodiesel of high quality, the optimum inputs were dosage of catalyst (NaOH) from 1.00 to 1.38 wt%, molar ratio from 6.0 wt% to 8.4, mixing speed from 500 to 999.99 rpm, time from 20.00 to 26.94 min, temperature from 28.75 to 37.5 • C, humidity from 0 to 2.31 wt%, and impurities from 0 to 2.99 wt%. Cheng et al. [53] used a GA-based evolutionary SVM to predict and optimize the final acid value of oil in the production of biodiesel from rice bran. They found GA-ESVM better than ANN-GA and SVM. Sivamani et al. [54] used RSM and ANN coupled with GA to predict the yield of Simarouba giauca biodiesel. For higher yield, the optimum values for oil-to-alcohol ratio, temperature, and duration were found to be 1:6.22, 677.25 • C, and 20 h, respectively. Ighose et al. [55] used an ANFIS coupled with RSM and GA to realize high yield of biodiesel from Thevetia peruviana seed oil via the transesterification process. The ANFIS outperformed the RSM model. In addition, the use of GA resulted in a higher yield than RSM in relatively less time and less catalyst loading.
Dhingra et al. [56] used ANN and GA to predict and optimize yield in the production of polanga oil based biodiesel. They combined the ANN, RSM, and GA for predicting the optimized reaction conditions which resulted in a biodiesel yield of 91.08% by weight significantly higher than 78.8% obtained through RSM alone. Ishola et al. [57] used ANN, ANFIS, and GA to estimate and optimize biodiesel yield (methyl esters) from sorrel oil. The ANFIS model outperformed the ANN model while RSM was the lowest performer in terms of prediction accuracy. In addition, GA also outperformed the RSM and obtained the highest biodiesel yield (methyl esters) of 99.71 wt% at the methanol-to-oil molar ratio, the catalyst weight and reaction time of 8:1, 1.23 wt%, and 43 min, respectively. Silitonga et al. [58] used ANN integrated with ant colony optimization to determine the minimum acid value and maximum biodiesel yield from cerberamanghas. For esterification, the optimum methanol-to-oil molar ratio was 10.5:1 while the best values for reaction time and reaction temperature were 71 min and 54.5 • C, respectively. The optimized catalyst concentration, reaction temperature, methanol-to-oil molar ratio, stirring speed, and reaction time for transesterification were 1.1 wt%, 55 • C, 10.9:1, 1100 rpm, and 72 min, respectively.
Chakraborty et al. [59] used multivariate regression analysis to predict optimum operating conditions for mustard oil (MO)-based biodiesel yield. Optimal values of methanolto-MO molar ratio, calcination temperature, catalyst concentration, and stirrer speed, were 13.13:1, 950 • C, 3.44 wt%, and 890 rpm, respectively. Goharimanesh et al. [60] used multiobjective GA to obtain optimum reaction temperature for maximizing biodiesel production, amount of ester, and alcohol.
Oladipo et al. [61] used CCD and ANN to maximize FAME content in the production of biodiesel from crude neem, jatropha, and waste cooking oils. They also found that safe reuse of the mesoporous catalyst can be carried out up-to five cycles. Rajendra et al. [62] used ANN, GA, and central composite in rotatable design to optimize the final acid value of oil in the production of biodiesel from jatropha, simaruoba, mahua, and rice bran oils. ANN-GA helped in determining optimum process conditions to obtain high yield.
Fahimi and Cremaschi [63] used ANN to predict virgin oil and methanol-based biodiesel yield. An optimization model was used to determine the minimized net present sink in synthesis of biodiesel. The models for unit operation, thermodynamics, and mixing were replaced by the surrogate models to reduce computational load. Betiku et al. [64] used CCD, ANN, and RSM to determine high biodiesel yield from neem seed oil. The acid value of NO was significantly reduced by one-step esterification. Optimization of transesterification of pretreated NO using KOH as catalyst resulted in 99% yield of biodiesel. Zhang and Niu [65] used LS-SVM with GA to estimate and optimize biodiesel yield from castor oil. Based on high accuracy in prediction, the use of the LS-SVM model was recommended for efficient prediction in the process. Optimum values of catalyst weight was 13 g, MOR at 625, the temperature at 4060 • C, and time of 1240 min. Mujtaba et al. [66] used ELM and RSM together with cuckoo search algorithm to find best cold flow and lubricity characteristics of biodiesel produced from the palm-sesame oil blend. Bemani et al. [67] developed LSSVM model for estimation of the cetane number of biodiesel. The LSSVM was coupled with GA, PSO and hybrid of GA and PSO (HGAPSO) algorithms for the process optimization.

Estimation and Optimization of Process Conditions and Efficiency
Karimi et al. [68] used RSM and ANN to estimate FAME content and exergetic efficiency in waste cooking oil-based production of biodiesel. The method performed well in achieving high quality and exergetic efficiency of the process by optimizing the input variables. Reaction time, immobilized lipase, concentrations of water, and concentrations of methanol were the design variables. The catalyst concentration of 35%, the water content of 12%, methanol to WCO molar ratio of 6.7 and reaction time of 20 h achieved FAME content and exergy efficiency of 86% and 80.1%, respectively. Aghbashlo et al. [69] used ANFIS with GA and linear interdependent fuzzy multi-objective optimization to predict Functional Exergy Efficiency (FEE), Normalized Exergy Destruction (NED), Universal Exergy Efficiency (UEE), and Conversion Efficiency (CE) in production of biodiesel. Optimum values for transesterification temperature, residence time, and methanol-to-oil molar ratio were found to be 60 • C, 10 min, and 6.20, respectively. Patle et al. [70] used multi-objective GA optimization to estimate heat duty, profit, and organic waste in the palm waste cooking oil based biodiesel production. Waste cooking oil flow rate was the factor affecting heat duty, profit, and organic waste. Shukri et al. [71] used ANN to predict pressure in-cylinder in a bar, heat release in percentage, volume generated, and thermal efficiency in percentage in palm oil methyl ester blends-based biodiesel production. Biodiesel 10% blend (B10) was found to be more efficient due to the high heating value and cetane number. Sarve et al. [72] used ANN to estimate ethanol-to-oil molar ratio, temperature, reaction time, and initial CO 2 pressure in mahua oil based production of biodiesel. Sensitivity analysis of the ANN model was performed where the temperature was found to be the most effective variable followed by reaction time, ethanol-to-oil molar ratio, and initial CO 2 pressure. ANN outperformed the RSM model both in data fitting and prediction accuracy. Kuen et al. [73] applied an automatic tune control scheme consisting of Recursive Least Squares (RLS) and Internal Model Control (IMC) integration to get optimized values of the adaptive controller parameters for the biodiesel transesterification reactor. For introduced disturbance of 5% rise in the reactor temperature and concentration loops from nominal values, in comparison to conventional PID controllers, adaptive controllers' response time was much faster, i.e., 370 s and 380 s, respectively.
Rouchi et al. [74] used a Multivariate Curve Resolution Alternative Least Square (MCR-ALS) for interpretation and control of the reaction towards the desired route. For the said purpose, the number of components, concentration profiles, spectral, and reaction kinetics were evaluated for Soybean oil-based biodiesel with reagents consisting of methanol and NaOH. The correlation coefficient and standard deviation of residuals were 0.99992 and 0.00765, respectively, which showed an advantage of MCR-ALS. Lopez-Zapata et al. [75] used virtual sensors based on the extended Kalman filter to estimate concentrations of triglycerides, monoglycerides, methyl ester, diglycerides, glycerol and alcohol in jatropha oil-based production of biodiesel. The method had the potentials of real-time implementation because it needed only a few measured variables, such as temperature and pH. Nicola et al. [76] used multi-objective GA optimization to realize maximum purity of important compounds and minimum energy requirements in the production of vegetable oil-based biodiesel by two processes. The specific energy consumptions for process schemes were 2.7 MJ/kg and 1.5 MJ/kg that met the required standards.
Fahmi and Cremaschi [63] used ANN as a surrogate model to identify the superstructure and operating conditions which minimized the net present sink in the production of biodiesel. ANN was used as a surrogate model which resulted in a less complex model with an efficient representation of the process synthesis. Soltani et al. [36] used ANN to model nanocrystalline-sized mesoporous zinc oxide (SO 3 H-ZnO) catalyst for the efficient production of biodiesel from palm fatty acid distillate-based production of biodiesel. The prediction error was within an acceptable range of 2.73%. Noriega and Narvaez [77] used Group Interaction Parameters (GIP) to predict Liquid-Liquid Equilibrium (LLE) in the vegitable oil based production of biodiesel. The most influential variable on LLE was the overall mass fraction followed by the length of alcohol chain. Wong and Wong [78] used Extreme Learning Machines (ELM) with Lyapunov analysis to predict the Air-to-Fuel Ratio (AFR) in the production of biodiesel from biofuel blends. The proposed approach resulted in effectively regulating Air-Fuel Ratio (AFR) to the desired level. The control strategy outperformed the engine built-in AFR controller and was highly recommended for dual-injection engines.

Biogas Quality Estimation
Tufaner et al. [79] used ANN for simulating and optimizing operating conditions of Upflow Anaerobic Sludge Blanket (UASB) reactors for biogas generation. It was observed that ANN can efficiently predict the biogas yield from a laboratory-scale UASB reactors. Asadi et al. [80] used ANN and ANFIS with subtractive clustering, Fuzzy C-Means Clustering (FCMC), and grid partition for prediction of biogas production rate from an anaerobic digesters. Based on the results, the ANFIS-FCMC model outperformed the other sets of models. Akkaya et al. [81] used the multiple regression model in the production of biogas from landfill leachates. The proposed method demonstrated sufficient prediction accuracy.

Yield Estimation
Ghatak and Ghatak [82] used ANN to predict the yield of biogas from cattle dung, sugarcane bagasse, bamboo dust, and sawdust under mesophilic and thermophilic conditions. The capability of ANN modeling significantly reduced the processing time required for control of the process. Nair et al. [83] used ANN to evaluate biogas yield in an anaerobic bioreactor from the organic fraction of municipal solid consisting of vegetable waste, food waste, and yard trimming. It was inferred that an optimized CH 4 recovery can be reached at pH range between 6.6 and 7.1 with Total Volatile Solids (TVS) from 77 to 84%. Antwi et al. [84] used different training algorithms for ANN models along with multiple nonlinear regression (MnLR) to estimate biogas and methane yield from chemical, industrial sludges of paper, automobile, petrochemical, and food industries. It was concluded that conjugate gradient backpropagation and the Quasi-Newton method were the best among eleven training algorithms.
Ihunegbo et al. [85] used PLS to predict the yield of biogas from bioslurry. The results indicate that the acoustic chemometrics is a reliable Process Analytical Technologies (PAT) approach to monitor Total Solids (TS) in complex bioslurry and the same concept can be extended to other biomass processing industries as well. Qdais et al. [86] used ANN to predict biogas yield in the production of biogas from the waste digester. The ANN model was effective in capturing the important features of the variables involved in biogas digester operation for methane production.

Optimization of Quality and Yield
Qdais et al. [86] also integrated ANN model with the GA for optimizing operational parameters that resulted in 6.9% increase in yield. Dibaba et al. [87] determined that the best performance of Upflow Anaerobic Contactor (UAC) with 87% COD removal, and hydraulic retention time of 16.67 days where an increase of 7.4% in biogas production was realized. Barik and Murugan [88] used ANN and GA to estimate and optimize the yield of biogas from cattle dung and seed cake of Karanja in co-digestion. The product quality using co-digestion of cake of Karanja and cattle dung mixture was higher than when using cattle dung samples for a mixing ratio of 1 cake of Karanja to 3 cattle dung. Oloko-Oba [89] used ANN integrated with GA to predict biogas production from cow dung, poultry droppings, and piggery waste. The optimal amount of poultry droppings, cow dung, plantain peel, and piggery waste were 0.7 kg, 0.0004 kg, 0.29 kg, and 0.61 kg, respectively. Zareei and Khodaei [90] used the ANFIS to estimate and optimize the production of anaerobic digestion-based biogas from cow manure and maize straw. The ANFIS model helped in optimizing the process conditions that resulted in 8% rise in production. Kana et al. [91] used ANN and GA to predict the optimum combination of rice bran, paper waste, banana stem, sawdust, and concentration of cow dung that enhanced the yield of biogas. Akbas et al. [92] used integrated ANN and Particle Swarm Optimization (PSO) model for robust control of production system of biogas. The integrated estimation and optimization framework increased the production and quality of biogas, and boosted the quantity of electricity production at the affiliated wastewater treatment facility.

Biohydrogen
Nasr et al. [93] used the ANN model for the prediction of hydrogen production from different substrates. The initial pH, temperature, initial substrate, biomass concentrations, and time were used as inputs of the model. The ANN exhibited high capability in capturing the correlation among parameters and the process output. Whiteman and Kana [94] predicted hydrogen yield using ANN and RSM. It was observed that ANN has greater accuracy than RSM. Ren et al. [95] used gray and ANN model for prediction of biohydrogen yield from feedstocks comprising of agricultural residues, paper wastes, and wood chips. The gray box model outperformed the ANN model in predicting the output in the context of uncertain data. Prakasham et al. [96] integrated ANN with GA for the prediction of biohydrogen yield from mixed anaerobic microbial consortia. The optimization strategy resulted in 16% increase in biohydrogen yield. Aghbashlo et al. [97] used a novel hybrid fuzzy clustering-ranking method with ANN to predict exergetic efficiencies in the production of hydrogen from photo-fermentation. Optimum values of flow rate of syngas and agitation speed were 13.68 mL/min and 348.62 rpm, respectively.

Miscellaneous (Bioethanol, Bisabolene)
Ezzatzadegan et al. [98] used Fuzzy Neural Network (FNN) and PSO to predict the yield of bioethanol from corn stover. The optimum fermentation time and required temperature were 69.39 h and 34.5 • C, respectively. Del Rio-Chanona et al. [99] used ANN-based multi-objective optimization with a hybrid stochastic search optimization in bisabolene production from microalgal biofuel.

Consumption, Engine Performance and Emissions
ML application in consumption, engine performance, and emissions are mostly performed simultaneously in the studies reported in the literature. Hence, all these aspects in this section were combined. Dominant ML methods in these studies were ANN, ANFIS, Extreme learning, SVM, and PLS. The studies are predominantly based on biodiesel as a fuel hence classification is carried based on the ML methods as shown in Table 5 and discussed in the Sections 3.4.1-3.4.4. Table 5. Summary of ML applications in the consumption and emission phase.
Ismail et al. [100] used ANN to predict CO, CO 2 , NO, unburned hydrocarbons, maximum heat release rate, the maximum pressure, location of maximum HRR, location of maximum pressure, and cumulative HRR (CuHRR) of an engine using the soybean and palm oil-based biodiesel. The ANN model demonstrated high prediction capability of engine combustion and emission behavior. Sharon et al. [101] used ANN to predict hydrocarbon, brake thermal efficiency, brake specific fuel consumption, NO x , CO, and smoke density of biodiesel produced from vegetable fried oil and non-vegetable fried oil. The prediction accuracy for B15, B30, B60, and B90 was in an acceptable range. Javed et al. [102] used different training structures of ANN to predict BTE, BSFC, O 2 , CO, NO x , HC, CO 2 , and EGT of biodiesel produced from jatropha methyl ester. Levenberge Marquardt as a training algorithm with 16 neurons gave the best prediction performance. Canakci et al. [103] used ANN to predict emissions, flow rates, engine load, maximum injection pressure, thermal efficiency, and maximum cylinder gas pressure. The ANN performed well in terms of prediction of the output except emissions, such as NO x , CO, and UHC where mean error in prediction was higher. Oguz et al. [104] used ANN to estimate hourly fuel consumption, power, moment, and specific fuel consumption of biodiesel. The ANN model was found suitable for the prediction of engine performance. Barma et al. [105] used ANN to predict mechanical efficiency, mean effective pressure, Air-to-Fuel Ratio (AFR), fuel consumption, and torque of engine consuming biodiesel engine. The BPANN gave adequate prediction accuracy for the different fuel blends. Celebi et al. [106] used ANN to estimate the noise and vibration level of biodiesel produced from a blend of sunflower, conventional diesel, and canola biodiesel. The ANN model outperformed the Linear Regression (LR) model. Javed et al. [107] used ANN model for prediction of noise of engine operating on biodiesel with hydrogen dual-fueled zinc-oxide nanoparticle blend. The least noise was found for the H 2 flow rate of 1.5 L/min. Aydogan et al. [108] used ANN for estimation of engine performance, the engine torque, engine power, the Specific Fuel Consumption (SFC), and EGT of engine using cotton and rapeseed oils biodiesel. The ANN accurately predicted the engine performance, the engine torque, SFC, and EGT. Shojaeefard et al. [109] used ANN to estimate performance and exhaust emissions, BSFC, brake power, and exhaust emissions of DI engine working on biodiesel developed from castor oil. The ANN performance was compared with a group method of data handling. The ANN model was better in terms of prediction accuracy but the group method of data handling models was superior in terms of simplicity. Sharma et al. [110] used ANN for estimation of performance, BSFC, exhaust temperature, and exhaust composition of an engine using Polanga biodiesel. A very highly accurate prediction with a correlation coefficient closed to one was achieved. Omidvarborna et al. [111] used ANN to predict NO x emissions and concentration of NO x from EGR engines and non-EGR engines using soybean based biodiesel. The application of ANN was recommended for the estimation of NO x emissions from both EGR engines and non-EGR engines. Karthickeyan et al. [112] used ANN model for estimation of performance and emissions characteristics from engine using orange oil-based biodiesel. Orange oil Methyl Ester (OME) with the Variable Compression Ratio (VCR) engine demonstrated higher efficiency and lesser fuel consumption. Menon and Krishnasamy [113] used ANN with GA to optimize emission characteristics and performance of a biodiesel engine. For realizing optimum biodiesel composition, the total saturated methyl ester contents were from 36 to 43 wt% and unsaturated contents were from 55 to 63 wt%, respectively.
Ghobadian et al. [114] used ANN to estimate fuel consumption, torque and emission of engine working on waste cooking oil-based biodiesel. The correlation coefficient and Mean Squared Error (MSE) for torque, SFC, CO, and HC were close to 1 and 0.0004, respectively. Pai et al. [115] used ANN to estimate emission characteristics and performance of a variable compression ratio CI-engine working on waste cooking oil based biodiesel. The mean error values of ANN were less than 8%, which is acceptable. Muralidharan and Vasudevan [116] used ANN to predict emission and performance of a four-stroke variable compression ratio engine and a single-cylinder using cooking oil-based biodiesel. A good agreement was found between predicted and experimental measurements. Najafi et al. [117] used ANN to predict energy and exergy efficiency, and exhaust temperature in the usage of waste cooking oil-based biodiesel. The ANN was more efficient compared to the firstprinciple models. Kannan et al. [118] used ANN for prediction of performance, torque, power, and specific fuel consumption of a biodiesel engine. The optimum values of injection timing and injection pressure were 25.5 • Before Top Dead Center (BTDC) and 280 bar, respectively. Jaliliantabar et al. [119] used ANN for prediction of emissions, load, and speed of an engine working on blend of biodiesel fuel derived from waste cooking oil in diesel. An optimum operation was achieved with the reduction for CO, CO 2 , HC, NO x and smoke emissions approximately 47.25%, 48.23%, 52.7%, 94.55% and 44.29%, respectively. Kurtgoz et al. [120] used ANN to predict biogas engine performance, BSFC, thermal efficiency (TE), and volumetric efficiency (VE) of biogas produced from bovine manure. It was concluded that ANN can accurately estimate TE, BSFC, and VE values.
Aydogan [121] used ANN to predict NO x , SFC, and maximum cylinder inner pressure caused by the usage of various blends of biodiesel, bioethanol, and diesel. ANN exhibited high prediction accuracy with a correlation coefficient of 0.98. Ilangkumaran et al. [122] used ANN for estimation of the engine performance, HC, CO, CO 2 , NO x , BTE, and smoke from engine working on biodiesel from fish oil. The ANN model exhibited high prediction accuracy. Tosun et al. [123] applied Linear Regression (LR) and ANN to predict torque and exhaust emissions (CO, NO x ) of a naturally aspirated diesel engine running on biodieselalcohol mixtures. ANN had more accurate results than LR. Dharma et al. [124] used ANN to predict emission characteristics and performance of a single-cylinder DI-engine using mixed biodiesel-diesel fuel blends. The ANN model was able to accurately predict the outputs for different blends of the fuel. Najafi et al. [125] used ANN with GA to estimate exhaust emissions including NO x , PM, CO, and UHC of biodiesel blend of glycerol triacetate. With the use of biodiesel and additive, a reduction of emission of NOx and CO up to 63% and 42%, respectively, was realized. The PM was also substantially reduced by 27 times. Ozgur et al. [126] used ANN to predict CO, CO 2 , NO x and NO 2 emissions from engine using soybean oil-based biodiesel. Consequently, a close agreement was found between the predicted and experimental results.

Neuro Fuzzy Inference System
ZareNezhad and Aminian [127] used ANFIS to predict surface tension of biodiesel prepared from soybean, rapeseed, palm, and sunflower. The ANFIS-based framework outperformed the reported work and the surface tensions values for ten different biodiesels. A high correlation was found between the model estimated values and the experimental data. Gopalakrishnan et al. [128] used ANFIS and the Dynamic Evolving NFIS (DEN-FIS) to predict emission from transit bus using real-world data of NO x , HC, CO, CO 2 and PM of biodiesel. The ANFIS outperformed the DENFIS in prediction of emissions. Mostafaei et al. [129] used ANFIS models to predict the cetane number of biodiesels from its FAME composition. The ANFIS models developed by Fuzzy C-Means (FCM) and grid partition FIS techniques have higher final desirability of 0.718, and 0.857, respectively. Sakthivel et al. [130] used Fuzzy logic and GA to predict emission, performance, and combustion parameters of CI-engine working on biodiesel from fish oil. For high engine performance and reduction in emissions, best blends were identified. The exact biodiesel proportion for no-load, 25, 50, 75, and 100% loads were found out using Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) as 17%, 17%, 18%, 17%, and 20%, respectively. Sakthivel [131] applied fuzzy logic to predict BTE, HC, EGT, NO x , smoke, CO, CO 2 , Combustion Delay (CD), Ignition Delay (ID), and Maximum Rate of Pressure Rise (MRPR) of biodiesel produced from fish oil. The fuzzy approach had an edge over theoretical and empirical methods in terms of prediction accuracy. Debnath et al. [132] used GA to predict BTE, CO, NO x of engine butanol-based biodiesel. Less biodiesel and higher butanol percentage yields a good impact on performance and emission. Blend with 10% butanol, 10% biodiesel, and 80% diesel resulted in high heat release rate, cylinder pressure, BTE, and reduced NO x . Ardabili et al. [133] used ANFIS model for the estimation of cetane number of biodiesel. It was also investigated that a rise in the carbon number of FAMEs increases the viscosity, cetane number, and HHV. However, a rise in the number of double bonds causes a decrease in viscosity, cetane number, and HHV.

Extreme Learning Method
Silitonga et al. [134] used ELM for prediction of BSFC and thermal efficiency of an engine running on biodiesel and bioethanol blends. The biodiesel-bioethanol-diesel blends had oxidation stability of more than 20 h that showed the potential of their commercialization. Silitonga et al. [135] used a Kernel-based ELM (K-ELM) to predict exhaust emissions, performance, and characteristics of combustion of biodiesel. K-ELM demonstrated the high potential of application in prediction and process optimization of biodiesel derived from a variety of feedstocks. Wong et al. [136] used K-ELM for prediction of the fuel consumption and emissions characteristic of engine working on biodiesel. With the K-ELM, Cuckoo Search (CS) was then used to determine the optimal biodiesel ratio. The CS optimization was compared with experimental results and PSO. In computational time, the CS and PSO were similar. However, in case of PSO, the time for tuning the parameters was more than CS because CS had a lower number of user-defined parameters than PSO. Wong and Wong [137] used Bayesian ELM (B-ELM) and metaheuristic optimization to predict fuel consumption, the k-value, and emission characteristics from engine using gasoline and ethanol. The metaheuristic optimization methods was also applied for identifying optimal ECU setup.
Wong et al. [138] used ELM, LS-SVM, and RB-FNN to predict performance and the concentrations of NO x , CO, HC, CO 2 , and PM in emissions from engine working on biodiesel. The ELM performed better than LS-SVM and RB-FNN. Wong et al. [139] compared the prediction accuracy of SB-ELM with conventional ELM, B-ELM, and ANN to estimate performance parameters of engine, fuel consumption, and the concentrations of CO, HC, and CO 2 emissions. SB-ELM outperformed the other methods.

Support Vector Machine and Least Square Methods
Alves and Poppi [140] used an SVM and PLS to estimate the biodiesel content in fuel blend. A comparison of SVM and PLS models was made, where SVM outperformed the former in terms of accuracy in prediction. Maheshwari et al. [141] used nonlinear regression to estimate the performance and emission characteristics, smoke, BTE, HC, and NO x emissions of biodiesel from Karanja. 3% biodiesel-diesel blend was found optimum for emissions and efficiency. Shamshirband et al. [142] used SVM Wavelet Transform (SVM-WT), SVM-RBF, SVM Firefly Algorithm (SVM-FFA), SVM based on quantum particle swarm optimization (SVM-QPSO) and ANN to predict exergetic parameter of a diesel engine and exhaust hot gas, exergy transfer rate to the cooling water, fuel exergy rate, and sustainability index of waste oil based biodiesel. The SVMWT approach was more efficient in prediction of exergetic efficiency and identification of best combustion properties, and fuel composition.

Application Summary
In the soil stage, ensemble and SVM were the most commonly used ML methods. The input variables for the ML application were soil characteristics, average precipitation, temperature, solar radiation, and wind speed. The output variables for the ML application were biomass yield and future life cycle environmental impacts. In the feedstock stage, ANN, statistical regression model, multiple linear regression, RF, SVR, and multiple nonlinear regression were used. The input variables for the ML application were blend composition, mixing speed, mixing time, heating values, temperature, size dimensions of microalgai, and characteristics of fluorescence signals. The dominant output variables were viscosity, density, flash point, higher heating values, oxidating stability, cetane number, yield and characteristics of biocrude and hydrochar, and fraction of methane.
The production phase was divided with regard to the type of biofuels produced, such as biodiesel, biogas, biohydrogen, and miscellaneous. The biodiesel category was further divided into four categories based on the nature of the application, such as (1) quality estimation, (2) yield estimation, (3) quality, and yield optimization, and (4) estimation and optimization of process conditions, and efficiency.
For the quality estimation, the dominant ML method was ANN followed by the regression model. The top five most commonly used input variables were reaction time, reaction temperature, calcination temperature, flow rate, and pressure. In the yield estimation, the dominant ML method was ANN followed by ANFIS. The top five most commonly used input variables were catalyst concentration, reaction time, temperature, methanol-to-oil molar ratio, and total volatile fatty acid (VFA) of the effluent.
In the quality and yield optimization section, GA-based ANN was the dominant ML method followed by GA-based SVM and ANFIS. The top five most commonly used input variables were methanol-to-oil molar ratio, reaction temperature, stirring speed, reaction time, and catalyst concentration. In the process efficiency estimation and optimization section, ANN was the dominant ML method followed by ANFIS in combination with various optimization methods. The top five most commonly used input variables were concentration, water content, reaction time, temperature, and methanol-to-oil molar ratio.
The biogas category was categorized based on the nature of applications, such as (1) quality estimation, (2) yield estimation, and (3) optimization of quality and yield. In the quality estimation, ANN was the dominant ML method followed by ANFIS and multiple regression models. The topmost commonly used input variables were volatile fatty acids (VFAs), total solids, fixed solids, volatile solids, and pH. In the yield estimation, ANN was the dominant ML method followed by MNLR models and PLS. The top five most commonly used input variables were temperature, pH, TVS, VFAs, and composition. In the quality and yield optimization section, ANN-GA was the dominant ML method followed by ANFIS. The top five most commonly used input variables were TS, TVS, pH, temperature, and carbon-to-nitrogen ratio.
For the biohydrogen, ANN and its integration with GA were used for yield prediction and optimization. The input variables were pH, substrate, biomass concentrations, temperature, and time. The output variables were biohydrogen yield, exergetic outputs, and COD removal. In the miscellaneous category, comprised of bisabolene and bioethanol, ANN, FNN and their integration with optimization techniques such as PSO were used. The input variables were incident light intensity, recycling gas flow rate, cardinal coordinates of sample, temperature, glucose content, and fermentation time.
The consumption, engine performance, and emission cases were studied simultaneously in most of the related papers, hence they were reviewed in a single subsection. Biodiesel was the dominant type of biofuels in this stage so the literature was rather classified based on the type of ML methods, such as ANN, ANFIS, ELM, and SVM.
In the ANN application, the top five input variables were biofuel blend, engine speed, load, cetane number, and output torque. The top five output variables were emission characteristics NO x , CO, CO 2 , BSFC and temperature. In the ANFIS application, the top five input variables were double bonds, blend, load, average carbon numbers, and temperature. The top five output variables were BTE, NO x , CO, smoke, and CO 2 , respectively. In the ELM application, the top five input variables were biodiesel ratio, engine speed, engine torque, fuel injection time, and idle air valve normal position. The top five output variables were fuel consumption, brake thermal efficiency, performance, exhaust emissions, and fuel concentrations. In the SVM and LS methods, the input variable was composition. Meanwhile, the output variable was the yield.
The overall trends observed in the ML application in the biofuels' life cycle are shown in Figure 2. The phase-wise application of ML is shown in Figure 2a. Meanwhile, the number of publications in the subject area has been consistently increasing (except for the years 2014 and 2018) as shown in Figure 2b. The number of publications related to the ML methods is shown in Figure 2c where the use of ANN was reported to be 72, followed by GA at 20, SVM at 15, and ELM at 14. The emergence of the ELM that belongs to the second generation of ML is shown in Figure 2d. Contribution in terms of studies conducted in the subject area is reported from across the globe as shown in Figure 3. The leading country was India with 25 publications followed by Iran, Turkey, Malaysia, and the USA with 21, 14, 13, and 12 publications, respectively.

Conclusions and Future Work
In the life cycle of biofuels, out of the four stages, the production stage holds a 52% share of the reported ML applications followed by consumption and emission, soil, and feedstocks stages with 35%, 9%, and 4%, shares, respectively. ANN has been consistently dominant from the first generation of ML methods. Interestingly, GA based optimization was the second-highly reported work after ANN. GA outperformed the conventionally used RSM approach when compared in several studies in realizing optimum process conditions. The ML methods in descending order in terms of application are ANN, GA, SVM, ELM, ANFIS, Regression (linear/non-linear), ensemble learning, LS, and PCA. The application of the second generation was reported for the first time in 2013. ELM was the dominant out of the second generation ML methods with several variations reported every year after 2013. The contribution in terms of studies conducted in the subject area was reported from across the globe, however, India, Iran, Turkey, Malaysia, and the USA collectively form 54% of authors' affiliation from the reported articles.
Although applications of ML in biofuels are found in the whole life cycle, there is no research applying ML to cover integrated supply chains in biofuels involving agriculture production, feedstock management, quality control, bioprocessing to transform biomass into biofuels and the consumption and related emissions of the biofuels altogether. The challenge is to better determine the interplay over the decision-making among multiple players of multi-commodities of the value chain of biofuels.
Making decisions on complex design, operation, and control of today's industry may count on the novel capabilities of advanced analytics in engineering. A potential future application of advanced analytics is to model Supply Chain Resilience (SCR) of transactions, logistics, operations, etc., of such complex representation of supply chain elements in the industry. Subsequently, ML approaches can be used to determine optimizable surrogate models to correlate independent variables (e.g., resistance and recovery of the SCR) to the dependent variable SCR.
A ML methodology to quantify SCR based on continuous x and binary y variables of resistance (avoidance and containment) and recovery (stabilization and return) can consider ad hoc relationships of dependent and independent variables to be part of the SCR predictions in the ML method. Such SCR algebraic or analytical formulas obtained in constrained decision regression approach can be used in optimization and control problems to move from traditional independent networks in order to create more flexibility in fulfilling demand through the complementary behavior of heterogeneous resources. Such coupling of multi-layered networks paves the way for optimal resource exchange, efficient decision making, and knowledge discovery through developing ML, control, and optimization techniques for large-scale interdependent decision making.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: