Application and comparison of MLR, ANN and CART models for predicting PM10 concentration level of Guwahati city (India)

Indian cities are increasingly becoming susceptible to PM10 induced health effects which have become a matter of concern for the policymakers of the country. Air pollution is engulfing the comparatively smaller cities as the rapid pace of urbanization, and economic development seems not to lose steam ever. A review of air pollution of 28 cities of India, which includes tier-I, II, and III cities of India, found to have grossly violated both WHO and NAAQS standards in respect of acceptable daily average PM10 concentrations by a wide margin. Predicting the city level PM10 concentrations in advance and accordingly initiate prior actions is an acceptable solution to save the city dwellers from PM10 induced health hazards. Predictive ability of three models, linear MLR, nonlinear MLP (ANN), and nonlinear CART, for one day ahead PM10 concentration forecasting of tier-II Guwahati city, were tested with 2016-2018 daily average observed climate data, PM10, and gaseous pollutants. The results show that the non-linear algorithm MLP with feedforward backpropagation network topologies of ANN class, giving the best prediction value when compared with linear MLR and nonlinear CART model. ANN (MLP) approach, therefore, may be useful to effectively derive a predictive understanding of one day ahead PM10 concentration level and thus provide a tool to the policymakers for improving decision-making associated with air pollution and public health.


Introduction
Over the past years, airborne particulate matter (PM) concentrations in Indian cities have been rising and have become a matter of concern for the policymakers in India. The Central Pollution Control Board (CPCB) of India now included forty-one comparatively smaller tier-II cities in the list of severely air-polluted cities [1]. The World Health Organization (WHO) has been very critical of increasing air pollution in Indian cities, in general, and its adverse health effects on urban populations [2]. The total pollution mortality counts for India are the highest in the world [3]. Apte et al. [4] estimated that the level of air pollution in India would require to be declined by 20% by 2030 to maintain per-person mortality at the 2010 level. Several studies in the Indian context and elsewhere established the relationship between high PM10 (particulate matter having an effective aerodynamic diameter smaller than 10 µm) levels with an increase in hospital admissions for lung and heart diseases [5,6,7,8]. The sources of air pollution for the Indian cities are varied. They include vehicle exhaust, small scale as well as heavy industries, power generation, brick manufacturing kilns, resuspended dust on the roads due to vehicle movement, construction-related activities, open waste burning, combustion of various fuels, and in-situ power generation using diesel generator sets [9]. degree of health risk for the city dwellers [26]. Third, we tested the predictive ability of three models, based on linear MLR, nonlinear ANN (MLP), and nonlinear CART (Classification and Regression Trees) analyses, for one day ahead PM10 concentration forecasting of Guwahati city. These models were critically assessed through a comparative evaluation of performance indicators keeping in mind the end goal to choose the best-fitted model for accurate forecasting PM10 at the city level. Though a higher level of mortality burden associated with air pollution exposure in Indian tier-II cities has been reported, not many studies have been performed on PM10 forecasting to date. It will be advantageous for the policymakers to take policy decision if a snapshot of the PM10 concentration level is known one day ahead with precision for a city like Guwahati through the use of the optimum of the statistical approaches readily available for use. This study is novel due to the fact that the testing of the predictive ability of three models has been carried out in the backdrop of tier-II cities of India. Unlike previous modeling efforts (Table 1), this is the first instance concerning the application of CART analysis as a statistical procedure for prediction of PM10 in a comparative set up of Indian city. In the recent past, Gocheva-Ilieva and Stoimenova [27] employed CART in predicting PM10 for the Pleven city of Bulgaria and claimed of very accurate model performance. The CART technique is evolving as a new method for analysis and forecasting of PM10 very fast [28].

Location of the study
The model development for forecasting of PM10 was attempted in the north-eastern Indian tier-II city of Guwahati, which is the capital city of the state of Assam, India. The city with about one million population has 340 km 2 urban space contiguous with a large river (the Brahmaputra) flowing through the city. It has a scenic landscape with the Brahmaputra River flowing on one side with the foothills of the Shillong plateau on the other side. Vehicular growth (both light and heavy vehicles) in the city was notable in the past decade with about a reported sharp rise of 87%. Guwahati has a humid subtropical climate. The four major seasons of the city are winter (December to February), spring (March to May), summer (June to August), and autumn (September to November), with the differing meteorological condition. Guwahati has been recognized as one of India's most rapidly growing cities in India for the last 10-12 years. There is black carbon pollution in the city air due to the rapid urbanization and poor environmental quality control [29]. Guwahati has six ambient air monitoring stations, set up under the National Air Quality Monitoring Programme (NAMP), to measure key pollutants [30]. One of the NAMP stations can measure PM2.5 while the newly developed CAAQM (Continuous Ambient Air Quality Monitoring) station started functioning only during mid of 2019. The locations covered for collecting data are shown in Table 2 and Figure 1 below.

Data Treatment
A few missing values were observed in respect of daily average concentration data for PM10, CO, NO2, and SO2 for the 2016-2018 time-series data. Linear interpolation technique was used in the imputation of missing values.
where x = independent value, 1 and 0 = known values of the independent variable and f(x) = value of the dependent variable for the value of the independent variable. However, there was no missing value for climate data (1096 data points).

Assessment of PM10 pollution across Indian cities
Approximately 120 English language articles published between 2011-2019 on monitoring and control of ambient particulate matter (PM10 and PM2.5) related to Indian cities have been reviewed. We included the journal papers from computer searches and bibliographic databases (e.g., Google Scholar, Pub Med, Academia, ResearchGate) in the analysis. We used the keywords such as PM10, PM2.5, air pollution, and Indian city to search for the relevant literature for the purpose in hand. The main aim of the review was to assess the PM10 status in various cities of India to understand the severity of the problem of air pollution across Indian cities. The problem severity automatically would establish the need for correctly predicting PM10 concentrations at least one or two days in advance to mitigate ill effects through minimizing the exposure.

Descriptive statistics and analysis of time series
Descriptive statistics of the climate data, PM10, and gaseous pollutants for the period 2016-2018 (1096 data points) and time series analysis were also worked out in respect of air quality monitoring station 6 to understand the characteristics and correlation of different variables over the period of the study. Station 6 was found to be a representative one out of six air quality monitoring stations of the city due to reasons like the completeness of data sets and common refection of land-use patterns of the city. Multiple time series charts were produced with time on the horizontal axis and PM10 concentrations, climate variables, and gaseous variables (RH, SO2, CO, NO2) on the vertical coordinate axes.

Predictive models development and validation
We have used MLR analysis, MLP class of ANN, and CART for forecasting of one day ahead PM10 concentration for all the six air quality monitoring stations of Guwahati city.

Multiple Linear Regression (MLR)
In MLR analysis, the mathematical model was built up to forecast the dependent variable, i.e. next day PM10 based on the inputs of independent variables comprising of climate variables and gaseous elements. In MLR, the coefficient of determination (R 2 ) indicates the overall capability of the model to handle variance in data. The regression model was composed following equation 2 [42,43].
where Y is the dependent variable, is the regression coefficients, is the independent variables and ε is a stochastic error associated with the regression. This relationship was used in this study to develop a mathematical equation model to predict the next day PM10 concentrations of the six ambient air monitoring stations of Guwahati with input variables like meteorological parameters, PM10, and gaseous pollutants. MLR assumes that the residuals have a normal distribution with a zero mean, uncorrelated and constant variance. The stepwise multiple linear regression procedure was used here to derive the mathematical equation [43]. Variance inflation (VIF) was used in this study to evaluate the multicollinearity effect on the variance of the estimated regression coefficient. The equation for VIF (Equation 3) is as follows:

Multi-Layer Perceptron (MLP) Model
ANN is considered to be a robust data modeling technique capable of handling the nonlinear relationship between variables and hence found suitable for the prediction of PM10 which requires exploration of the complex relationship between particulate matters, meteorological variables, and gaseous pollutants present in the atmosphere [44]. MLP architecture of ANN with FFBP topologies is the most common and effective model out of several models available. We have used MLP in this study to create predictive models for each of six ambient monitoring stations of Guwahati using nonlinear combinations of the input variables (meteorological parameters, PM10, PM2.5, and gaseous pollutants) to predict the next day PM10 concentrations. MLP forms a network of functionally interconnected neurons, also known as perceptron [45]. ANN scores more than MLR because of its ability to predict the dependent variable of a built-up model more accurately [46]. MLP has a simple structure consisting of three layers: the input layer, hidden layer, and output layer. One hidden layer was considered in our study, as it was suggested to be sufficient to achieve the optimum model capacity [47]. The number of neurons or the nodes, in the input layer, was equal to the number of input variables introduced in the model. The relevant input variables, i.e. observed meteorological parameters, PM10, and gaseous pollutants, are fed in the model as signals to the input layer of the model, which then passed on to the hidden layer. The neurons do the computations to detect features of the input variables and introduce them to the input layer with requisite weights. The weights are assigned to input variables based on their relative importance. The hidden layer does the critical function of nonlinear transformations of the inputs entered into the network through a predefined activation function. The neuron sums up information, including bias, in the hidden layer. The bias does the job of providing a trainable constant value to every neuron in addition to its normal value. The mathematical formulation of the MLP model is as shown below: where Y = output, F = transfer function, Wkj. = weights between hidden and output layers, Wji = weights between input and hidden layers, Xi = input variables, m= number of neurons in a hidden layer, n= number of neurons in an input layer, Bj= bias values of the neurons in the hidden, and Bk = bias values of the neurons in the output layers.

Classification and Regression Trees (CART)
CART is a non-parametric regression technique that can be employed for the prediction of an independent variable when the distribution of independents variables is not known. Typically, therefore, the CART method tries to ascertain the distribution pattern of the outcome (dependent) variable using the independent variables through their linear or nonlinear relationship with the outcome variable. CART builds up a decision tree through a hierarchy of binary decisions. Each binary decision will involve splitting of a target variable into two alternative and mutually exclusive branches (groups) depending upon the variation/values of the explanatory variable leading to the most considerable possible reduction in post-split variations/values of the target variable. In other words, splitting stops when there is no additional gain by further splitting can be achieved [48,49]. CART allows easy visualization of the process until the terminal mode is reached and can handle any type variables (numeric, binary, categorical, etc.). In this study, we built up predictive CART models for all the six air quality monitoring stations of the city with meteorological parameters, PM10, and gaseous pollutants as the independent variables and next day PM10 concentration as the dependent variable.
SPSS 25 has been used for computation of MLR and MLP while computation for CART SPSS modeler 18 has been used in this study.

PM10 across Indian cities: A review
The effort towards air quality improvement is not at all easy for a country like India as the country policymaker cannot forego the objective of faster economic development to sustain its vast population. Ambient particulate matter (PM2.5 and PM10) is being contributed increasingly by many diverse sources where the transportation sector is one of the most important amongst others. The rapid pace of urbanization, increasing affluence level, and inefficiency of the public transportation system have been jointly contributing towards the addition of a vast number of new automobiles in the cities every year. The number of automobiles registered in India has already reached 0.21 billion in 2015 from a mere 58.92 million during the year 2002. If the whole of India is to be considered then about 53,929 automobiles hit the roads of India every day [53]. All led to the current status of air pollution in India more notable in the world. Table 4 below presents a summary of the output of a number of studies (quantitative and qualitative) conducted by different researchers in the context of 28 Indian cities and their reported level of PM10 concentrations. Amongst 28 cities considered here, 7 are tier-I, 13 tier-II, and 8 tier-III Indian cities. Contribution of specific sources can vary significantly across the cities, and meteorological parameters prevailed in the respective cities also played an important role in building the ambient PM10 concentrations. It can be seen from Table 4 that Kolkata, a tier-1 city in India, even clocked PM10 concentration of as high as 445±210 μg m -3 during the wintertime [54]. Annual PM10 concentrations in New Delhi were reported to be 222 ±142 μg m -3 while an earlier study reported summer and winter mean concentrations were of 95.1 ± 22.2 μg m -3 and 182 ± 32.5 μg m -3 respectively [55,56]. Bengaluru also registered a high annual mean PM10 concentration The tier-II cities are also not lagging far behind the tier-I cities of India in terms of PM10 pollution. It can be seen from Table 4   The PM10 concentrations of the 28 Indian cities mentioned in Table 4 also grossly violated the air quality standard values of 20 μg m -3 and 60 μg m -3 enacted by the WHO and NAAQS of India respectively. PM10 times NAAQS and PM10 times WHO standards in different Indian cities considered in this study can be seen in Figure 1

Descriptive statistics
The mean values and standard deviations of the meteorological parameters, PM10, and gaseous pollutants of the respective air quality monitoring stations of the city under consideration are provided in Table 5. High variability was observed in the PM10 level. During 2016-18 the daily average PM10 concentration varied. Across the six air quality monitoring stations, the maximum and minimum mean PM10 concentration was 133.32 μg m -3 and 51.41 μg m -3 respectively. The highest daily average PM10 recorded was 259.39 μg m −3 , while the lowest was 40.67 μg m −3 during the period 2016-18. The average RH level of the city was found to be on the higher side while wind speed on the lower side. The time-series data reveals the maximum temperature of 34 o C recorded during the summer season while the minimum was 14 0 C during the winter season. Guwahati received rainfall due to southwest monsoon and the highest rainfall occurred from June to August.

Correlation of PM10 concentration, climate variables, and gaseous variables
In Figure 4 (A) -4(D), the time series of the observed meteorological parameters, PM10, and gaseous pollutants are reported in respect of air quality monitoring station 6 of the Guwahati city. It can be observed from Figure 4 (A) that the site is characterized by relatively high humidity throughout the year. The time series, considered in this study shows that the concentration of PM10 has maintained almost a negative correlation with relative humidity. PM10 concentration behavior of the city shows a pattern of annual cycle with high concentrations during winter (December to February), possibly due to lower planetary boundary layer height, and a higher level of concentrations seems to continue up to the months of March-April as well, i.e. beyond winter.
Another peculiarity of the site is that both CO and SO2 have a correlation with PM10 concentrations suggesting a common source for these compounds but the correlation with SO2 is stronger as shown in Figure 4 (B) and 4 (C). The PM10 also shows almost a linear correlation with NO2, as evidenced in Figure 4 (D) below.

Multiple Linear Regression Model for PM10 forecasting
The MLR models summary, developed for all six ambient air quality monitoring stations located at Guwahati, have been placed in Table 6. The range of the Variance Inflation Factor (VIF) for the independent variables of all the six MLR models is found in order as they are below 10, showing the non-existence of multicollinearity issues in the models. Durbin Watson (D-W) statistics show that the models can accommodate the autocorrelation, as the values were in the range of 2.103-2.239. The residual (error) is critical in choosing the robustness of the factual model as linear regression is sensitive to outlier effects. Figure 5A -5F shows the histogram plot, which indicates that the residuals are also normally distributed with zero mean and constant variance.

Multi-Layer Perceptron Model
The operating neurons of ANN (MLP) connect the three layers, input layer, hidden layer, and output layer, through adaptable synaptic weights which indicate the strength of the relationship between two connected neurons. Every neuron of each layer summed up all the inputs accepted from previous layers and become the output neuron as per the selected transfer function. The training data set is propagated in the forward phase, through the hidden layer, which comes out through the output layer. The error, i.e. the difference between output values and actual target output values are propagated back toward the hidden layer until the errors are reduced in successive cycles [76]. In the process the neural network learns and change weights during forward and backward phases. The normalized input variables PM10, RF, T, RH, WS, NO2, SO2, CO of the respective air monitoring stations were fed into the six different ANN models using the normalizing data conversion facility of the ANN module of SPSS software. We , in this study, engaged different combination of transfer functions like sigmoid/ hyperbolic tangent, sigmoid /linear, sigmoid /sigmoid and hyperbolic tangent/linear functions to compare and pick up the optimum R 2 values as shown in Table 7. In this way, The network structure, transfer functions of each of the models, and performance indicators can be seen in Table 7 below. The optimum R 2 values are also marked 'bold' in Table 7.

Predictive CART Model
By using CART analysis, several decision trees were developed based on different combinations of observed meteorological parameters, PM10, and gaseous pollutants for the three years (2016-2018). As typical in machine learning, out of the total data points of the respective independent and dependent variables, 60% used as trained set while 40% as the test set. The optimum models were produced for each of the six air quality monitoring stations of Guwahati when they had the least relative errors in respective cases given by equation 5 below.
Relative error of CART = where S (K) is equal to the sum of the squared residuals at the terminal node and S (O) is the sum of squared errors of the dependent error around its mean in the root node. The predictive CART models and performance indicators are given in Table 8.

Model Comparison
All six performance indicators were put to use for comparing the one day ahead PM10 prediction performances of three methods, i.e. MLR, ANN (MLP), and CART to isolate the best model, as shown in Table 8. NAE, MAE, MSE, and RMSE were used to find the error of the model, where a value closer to 0 indicated a better model. The other two performance indicators, namely, IA and R 2 , were used to check the accuracy of the model result, where higher accuracy is given by a value closer to 1. The values for performance indicators provide specific information regarding predictive performance efficiencies [33]. RMSE wise comparison between models is best desired when the researchers want to avoid large prediction error.
On the other hand, MAE casts light on the average magnitude of the error without considering their direction. The advantage of the linear score of MAE lies in the fact that all individual differences between predictions and corresponding observed values are given equal weight in the average. However, amongst all seven performance indicators, R 2 is the best single measure of how well the predicted values match the observed values.
In this study, the prediction of one day ahead PM10 for all the six air quality monitoring stations displayed relatively good fits through the use of MLP methods (R 2 = 0. 63

Conclusion and Recommendation
A review of the literature of 28 Indian different categories of cities (tier-I, tier-II, and tier-III cities) reveals that the PM10 concentrations in all of them were on the higher side. Kolkata, a tier-I city in India, even clocked PM10 concentration of as high as 445±210 μg m -3 during the wintertime. The tier-III cities like Raipur and Kanpur were found to be not lagging far behind the tier-I cities in terms of ambient PM10 concentration. Interestingly, tier-III cities like Jharia and Sonipat were also recoded PM10 concentration as high as 333.7±17.86 μg m -3 and 213.67±151.49 μg m -3 respectively. The PM10 concentrations level in all of the 28 Indian cities grossly violated both WHO and NAAQS standards by a wide margin. Kolkata topped the list with 22.25 times more than the WHO standard and 7.42 times NAAQS followed by Bengaluru (17.49 times WHO standard and 5.83 times NAAQS), and Delhi (11.1 times WHO standard and 3.7 times NAAQS). Therefore, it is high time for some requisite actions for diminishing or preventing the build up of high ambient PM10 concentration level in the cities. One way out is abatement action through short term traffic reduction in cities based of predicted PM10 concentration level in advance. Therefore, it is high time for the initiation of some requisite actions for diminishing or preventing the build-up of the high ambient PM10 concentration level in the cities. One way out is abatement action through short term traffic reduction in cities based on predicted PM10 concentration level in advance. Therefore, it entails the correct prediction of the city level PM10 concentrations at least one or two days in advance and accordingly initiating prior actions to save the city dwellers from PM10 induced health hazards.
The tier-II city Guwahati recorded high variability in the observed in PM10 level due to the rapid urbanization. The highest daily average PM10 recorded was 259.39 μg m −3 , while the lowest was 40.67 μg m −3 during the period 2016-18. During 2016-18, the average daily NO2, CO, and SO2 concentrations were found to be in correlation with PM10 concentrations and thereby suggesting a common source for these compounds.
Different forecasting algorithms have been used in different cities of the world to predict PM10 in advance. However, the use of MLR with stepwise inclusion of input variable was found to be the most widely used tool for temporal prediction of PM10 in different urban areas of India, and that too mostly applied in bigger cities of the country. This study found that in better forecasting of the next day's PM10 concentrations in a tier-II city context, i.e. Guwahati the non-linear algorithm MLP with FFBN topologies of ANN class would be giving the best prediction value when compared with linear MLR and nonlinear CART model. These models were critically assessed through a comparative evaluation of performance indicators keeping in mind the end goal is to choose the best-fitted model for accurate forecasting PM10 at the city level. The result of the study reveals that the one day ahead PM10 for all the six air quality monitoring stations of Guwahati, prediction ability has been relatively better through the use of MLP methods (R 2 = 0. 63 In the backdrop of CPCB's acknowledgement that comparatively smaller tier-II cities are also facing severe air pollution, city authorities are contemplating initiating several steps for curtailing air pollution and health hazards thereof. We recommend the local authority to use the non-linear algorithm MLP (ANN) with FFBN topologies for forecasting PM10 concentration in the smaller Indian cities like Guwahati too for avoiding PM induced health hazards to a great extent. 'Predict pollution and defeat concentration' could be another approach to fight air pollution menace in addition to the odd-even rule, which few Indian cities are enforcing presently to rein on air pollution through curtailment of vehicular pollution. Moreover, with this model, the local SPCB authorities can caution city dwellers of impending dangerous levels of PM10, so that they can lessen their outdoor activities for those days and thereby avoiding exposure to unhealthy levels of air quality.