Robustness of Imputation Methods with Backpropagation Algorithm in Nonlinear Multiple Regression

Missing observations constitute one of the most important issues in data analysis in applied research studies. The magnitude and their structure impact parameters estimation in the modeling with important consequences for decision-making. This study aims to evaluate the efficiency of imputation methods combined with the backpropagation algorithm in a nonlinear regression context. The evaluation is conducted through a simulation study including sample sizes (50, 100, 200, 300 and 400) with different missing data rates (10, 20, 30 40 and 50%) and three missingness mechanisms (MCAR, MAR and MNAR). Four imputation methods (Last Observation Carried Forward, Random Forest, Amelia and MICE) were used to impute datasets before making prediction with backpropagation. 3-MLP model was used by varying the activation functions (Logistic-Linear, Logistic-Exponential, TanH-Linear and TanH-Exponentiel), the number of nodes in the hidden layer (3 - 15) and the learning rate (20 - 70%). Analysis of the performance criteria (R2, r and RMSE) of the network revealed good performances when it is trained with TanH-Linear functions, 11 nodes in the hidden layer and a learning rate of 50%. MICE and Random Forest were the most appropriate for data imputation. These methods can support up to 50% of missing rate with an optimal sample size of 200.


Introduction
Let Y be a real random variable revealed mean depends on x = (x 1 , · · · , x p ) ∈ with E( ) = 0. Let a parametric nonlinear regression model be represented by : Y = ζ(x 1 , · · · , x p ; θ) + where ζ is nonlinear with respect to θ, the set of model parameters. This means that, for at least one θ i , the derivative of ζ with respect to θ i depends on at least one of the parameters. For example ζ(x; θ) = θ1x1 1+θ2x2 is used by chemists. Differentiating ζ with respect to θ 1 and θ 2 gives: ∂ζ ∂θ1 = x1 1+θ2x2 and ∂ζ ∂θ2 = x2x1θ1 (1+θ2x2) 2 . One of the nonlinear models that has received great attention last few years is the model based on artificial neural networks (ANNs). They are used in the fields of prediction and classification, fields in which regression models and other related statistical techniques have traditionally been used [1][2][3][4]. Multilayer perceptron neural networks (MLPs) are one of the architectures of ANNs acting as a type of regression model, not necessarily parametric, which enables complex functional forms to be modeled [5,6].
In breeding, the knowing of production is necessary for specialists who need simple and accurate techniques to predict the production of meat, eggs, milk etc. Production is influenced by interdependent factors and MLPs offer more flexibility in describing their relationships. But data collected in the case of production are often small (due Let X = [x ij ] be a data matrix of dimension (n, p) of elements x ij ∈ R, where n and p elements of N * are respectively the number of observations, the number of variables and x ij is the value of the variable j ∈ [ [1, p]] for the observation i ∈ [ [1, n]]. Let Z = [z ij ], an indicator matrix of missing data elements z ij , such that z ij = 1 if x ij is missing, and z ij = 0 otherwise, then we have: X = {X obs , X mis }. The matrix Z describes the structure of the missing data and is useful to treat it as a stochastic matrix. The statistical model for missing data is P (Z|X, κ), where κ is the parameter of the missing data process and P (·), denotes the conditional distribution of Z given X, of parameters κ. The mechanism of missingness is determined by the dependency of Z on the variables in the data set. According to [12], three categories of missing data can be distinguished: Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing Not at Random (MNAR).
Definition 2.1. Missing data are "missing completely at random" (MCAR). Missing data are said to be missing completely at random (MCAR) when the fact of not having a value is totally independent of the variables X and we have: ∀ X, P (Z|X, κ) = P (Z|κ). (1) When the missing data are not MCAR, we need to know if differences in the characteristics of non-respondents and respondents can be explained by variables common to respondents and non-respondents. We note X obs , the observed part of the data X and X mis , the missing part.
Definition 2.2. Missing data are "missing at random" (MAR). The data are said to be missing at random (MAR) when the distribution of Z given X, depends only on the variables recorded in the database X obs , and we have: ∀ X mis , P (Z|X obs , X mis , κ) = P (Z|X obs , κ).
(2) Definition 2.3. Missing data are "Missing Non At Random" (MNAR) The data are said to be missing non at random (MNAR) when the distribution of Z given X also depends on X mis , and we have: ∀ X obs and X mis , P (Z|X obs , X mis , κ) = P (Z|X obs , X mis , κ).
There are two basic methods for managing data matrices with missing values: (i) the deletion method and (ii) the imputation method [24]. The first one considers only the individuals for which all the data are available, i.e. to delete any individual having at least one missing value. The second consists to replace the missing values in the data set by estimated values. Two imputation approaches are used: simple imputation and multiple imputation [25]. Single imputation is to fill in each missing value with a value. The second approach covers methods whose procedures are based on models. This is done by replacing the missing values with several simulated values to properly reflect the uncertainty that is attached to the missing data [26].
Factors affecting the predictive performance of a multilayer perceptron neural network and back-propagation algorithm A multilayer perceptron neural network (MLP) is a feedforward neural network, consisting of a number of units (called neurons) connected by weight links. The units are organized in several layers, the first one is an input layer, the last one is an output layer and the intermediate one can have one or several hidden layers. The input layer receives an external activation vector, and transmits it via weighted connections to the units of the first hidden layer. These compute their activations and transmit them to the neurons in succeeding layers, see figure 1. Although multilayer perceptron neural networks have shown good predictive performance compared to classical methods, they are often affected by factors such as: the number of neurons and layers, the choice of transfer functions and the sample size. For more detail, see [27]. The estimation of the network weights is done by minimizing a quadratic cost function. It can be done, among other things, by the backpropagation algorithm (BP), whose procedure is summarized as follows: 1. Initialize all weights to small random values in the interval [-0.9, 0.9]; 2. Normalize the training data; 3. Randomly permute the training data; 4. For each training data k: (a) Compute the observed outputs by forward propagating the inputs; (b) Adjust the weights by backpropagating the observed error from the output layer towards the input layer: w ij (k + 1) = w ij (k) + ∆w ij (k + 1) (4) = w ij (k) + ηδ j (k + 1)y i (k + 1) with w ij (k + 1), the adjusted weight for the neuron j; w ij (k), the previously computed weight for the neuron j; 0 ≤ η ≤ 1 representing the learning rate; δ j (k+1) is the local gradient computed for the neuron j and y i (k+1) representing either the output of neuron i on the previous layer, if it exists, or the input i otherwise. 5. Repeat steps 3 and 4 up to a maximum number of iterations or until the root mean square error is less than a certain threshold.

Specification of model and generation of a data population
The nonlinear regression model considered is a multilayer perceptron neural network with a hidden layer and its expression is : with E( ) = 0; x ∈ R p is a vector of p inputs; Y ∈ R, ζ θ (x) ∈ R are respectively the observed output and the predicted output. where 1p , · · · , w (1) m0 ; w (2) 11 , · · · , w (2) 1m ; b (2) are the model's parameters and the total number of parameters is : with m the number of neurons in the hidden layer, f 2 is a transfer function applied to only neuron of the output layer, f 1 is a vector composed of the same transfer function applied to each neuron in the hidden layer.
In order to have data with multicollinearity, a non-linear relationship between variables and for predictive purposes, we used the results of Insect as Feed for West Africa project [28] which evaluated the effect of maggot meal on the growth and economic performance of guinea fowl. The dependent variable is y = f ood economic ef f iciency and independent variables are x = (x 1 = dose with three modality (0, 50 and 100), x 2 = age, x 3 = f ood consumption, x 4 = weight). The predictive model obtained is : where y t represents the t th observation (t ∈ [[1, n]], n ∈ N * ); w i is the weight vector associated with the i th neuron in the hidden layer (i ∈ [ [1,11]]); b i and b are respectively the bias of the i th neuron in the hidden layer and the bias applied to output neurone of 3-MLP model. The optimal parameters are θ = (w (1) , b (1) , w (2) , b (2) ) with The total number of parameters, n θ = 67.

Simulations study
Seven factors were considered in this study. The sample size (5 different sizes), the missingness mechanism (3 different mechanisms), the missing data rate (5 different rates), the imputation methods (4 different methods) and the factors affecting the predictive and explanatory performance of the MLP model: the activation function (4 different functions), the number of hidden neurons (13 different size) and finally the learning rate (6 different rates). For each sample size, we have a combination of 936, 000 items, which is replicated 100 times.

Sampling size, simulating missingness and missing data imputation
Five samples of different sizes n i (n i = 50, 100, 300 and 400) were extracted from the population using the bootstrap technique [29]. Three missingness mechanisms were considered, MAR, MCAR and MNAR (see the Section 2) with five missing data rates (MR) (10,20,30,40 and 50%) to generate incomplete datasets. Missingness simulation is conducted on each of the five complete data using MICE package [30] from software R 3.3.6 [31]. Each of these previously obtained missing data are imputed with Last Observation Carried Forward (LOCF), Random Forest (RF), Amelia (AMELIA) and Multivariate Imputation by Chained Equation (MICE) methods in R using respectively zoo [32], missForest [33], Amelia [17] and MICE package [30].

Prediction with 3-MLP in R software
Before performing the prediction, 75% of each imputed dataset is used to train the neural network and 25% to test trained network concerning its generalization capacity. Before performing the training and testing, the imputed datasets were normalized using min-max normalization technique [34]: where v is an observation of vector z and new v is a normalized observation. The function "mlp" of RSNNS package [35] was used for the prediction. A 3-MLP model (see equation (5)) was used by varying hyper parameters for each sample size of imputed dataset. 4 combinations of activation functions, AF (f 1 and f 2 , see equation 5) were used: (i) Logistic-Linear (LL); (ii) Logistic-Exponential (LE); (iii) TanH-Linear (TL) and (iv) TanH-Exponentiel (TE). The expression of activation functions considered are: In additional, 13 numbers of nodes (Node) in the hidden layer were considered: 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15. In addition, 6 learning rates (LR) were considered: 20%, 30%, 40%, 50%, 60% and 70% and as well as the 5 sample sizes of imputed dataset (Size). The considered learning algorithm is Standard back propagation (see the section 2).
A total of 100 replications was performed on each size of imputed dataset to the analyze performance of the method. Initial weights were generated randomly according to the uniform law in the range −3 and 3. The stopping criteria used are the combination of a fixed number of epochs, NE= 1000 and a sufficiently small training error less than or equal to 10 −6 .

Performance criteria and statistical method comparison
The performance criteria used are: (i) Coefficient of correlation, r; (ii) Coefficient of determination, R 2 and (iii) Root Mean Squared Error, RM SE [36,37]. In the formula below, y and ζ θ respectively denote observed outputs and predicted outputs,ȳ andζ θ , their mean and n the test data size.
The appropriate imputation method for a missing data mechanism giving the best configuration of model characteristics (5) with the BP algorithm and for an optimal sample size is the model for which we observe a high correlation between predicted and observed data (|r| ≥ 0.8) [36], with R 2 close to "1" [38] and with a low value of RM SE [36].
To assess effects of factors (Size, MR, AF, Node and LR) which affect performance of the 3-MLP model, the generalized linear models based on the beta distribution were run on R 2 , |r| and the linear fixed effects models on RM SE for each missing data mechanism and by imputation method.
Interaction plot was considered for significant interactions between the MLP hyper parameters by missing data mechanism.
M ean, minimum, maximum and coef f icient of variation of the criteria considered (R 2 , |r| and RM SE) were used to compare imputation method performances.

Results
Effect of imputation methods by missing data mechanism on the performance of the hyper parameter structure of the 3-MLP model Table 1 shows the results of the effect of the imputation methods (Amelia, LOCF, RF and MICE) by missing data mechanism (MAR, MACR and MNAR) on the performance of the hyper parameter structure (AF, LR and Node) of the 3-MLP model. The analysis shows that AF, LR and Node significantly affect the performances of the imputation methods whatever the missing data mechanism (p < 0.05). However, the second order interaction of these factors (AF:LR:Node) did not impact (p > 0.05) the performances of imputation methods for the three missing data mechanism. The predictive performances (R 2 and r) of the imputation methods used were not affected by the interraction between learning rate and the number of neurone in the hidden layer (LR:Node) but had a significant impact on the root mean square error (RMSE) for each missing data mechanism. By considering the interraction between the activation function and the number of neurone in the hidden layer (AF:Node), we observed that from a missing data mechanism to another, the predictive performances of AMELIA and LOCF were not affected by this interaction. However those of RF and MICE were significantly affected. About the RMSE, apart the one of LOCF under MAR assumption, the others were significantly affected by this interraction. Results also revealed a significant effect on the performances of imputation methods concerning the interraction between the activation function and the learning rate (AF:LR) for all missing data mechanism. Table 1. Effect of imputation methods by missing data mechanism on the structure of hyper-parameters: Results of GLM and linear models

Factors
Amelia The interaction plots revealed that under MAR assumption, the performances of imputation methods increase with the learning rate when we use activation functions such that TanH-Linear (TL), Logistic-Linear (LL) and Logistic-Exponential (LE), see figure 2. Contrary to those activation functions, TanH-Exponential (TE) starts to decrease after 30% of learning rate. The best values of R 2 and r was obtained with the TanH-Linear activation function followed by Logistic-Exponential and Logistic-Linear. About the RMSE, the Logistic-Exponential yield the highiest values indicating that the network commit more error with this activation function. TanH-Exponential activation function gave the best RMSE. With this activation function the error vary slightly when learning rate increase contrary to the others function which increased when the learning rate increase. For TanH-Linear and Logistic-Linear, the RMSE was closed but after 40% Logistic-Linear yield a RMSE greater than the other one.
Similar trend have been observed when the missingness mechanism is either MCAR or and MNAR. Thus, highest values of R 2 and r have been observed with TanH-Linear while lowest value have been observed with TanH-Exponential. As observed under MAR assumption, the TanH-Exponential function commit little error when the data is MCAR or MNAR. For the three missingness mechanism, the predictive performances of the imputation methods vary slightly after 50% of learning rate indicating that the neural network can be trained with 50% of learning rate for each activation function. More a little variation of the error has been observed from 50% of learning rate for Tanh-Exponential and TanH-Linear contrary to Logistic-Exponential and Logistic-Linear which continuous to increase.

Effect of imputation methods by missing data mechanism on the performance of activation function and Node
The performances of imputation methods according to the activation function and the number of node in the hidden layer for the three missingness mechanisms revealed almost the same performance, see only figure 3. The predictive performances of imputation methods improve with the increase of the number of nodes for all missing data mechanism. When data is missing at random (MAR), R 2 and r values for the TanH-Linear activation function were greater than the values recorded with the other activation functions. The Logistic-Linear yield the lowest values of R 2 and r. The same trend has been observed under MNAR assumption. However, under MCAR assumption, it is TanH-Exponential activation function which had the lowest values of R 2 and r for LOCF method. Concerning the errors commit by the model, it became more and more lower when the number of hidden neurones increased and this for all the imputation methods for the three missing data mechanism. The model commits more errors with the Logistic-Exponential activation function when the TanH-Exponential functions commit less errors. The errors when the activation functions are Logistic-Linear and TanH-Linear were lower than the one with Logistic-Exponential. For this two activation function, the RMSE was similar from 3 to 7 neurons. After 7 nodes, the model with TanH-Linear was better than the one with Logistic-Exponential. The trend of RMSE observed under MAR assumption was similar to the one observed under MCAR and MNAR assumption.
We also noticed that for the three missing data mechanism a low variation of the performances (R 2 , r and RMSE) was observed whatever the imputation method used after 11 nodes in the hidden layer.
Effect of imputation methods on size and the missing data rate Table 2 shows how sample size, missing rate and their interaction affect the perfor-  When the data is missing at random (MAR), an improvement of R 2 and r had been noticed for LOCF method from 50 to 200 sample size whatever the missing data rate considered. But after 200, the predictive performances start to decrease. The RMSE for this method under the same missingness assumptions followed the same trend. It was closed between the missing data rate and it vary slightly from  Under missing completely at random (MCAR) assumption, the error was closed for all missing data rate for LOCF, Random Forest and MICE. It vary slightly under 200 sample size. The predictive performances of LOCF was best with 10% and 40% of missing rate at 200 and 300 sample size respectively. However the RMSE was greater with 40% of missing rate. About Random Forest and MICE R 2 and r values were better with 10% and 20% of missing rate respectively at 200 sample size. When data is missing not at random (MNAR), the performances obtained for LOCF were similar to what obtained under MAR assumptions. R 2 and r increased between 50 and 200 sample size whatever the missing data rate but decrease after 200. Error was closed between missing rate and was best under 200 sample size. The performances of Random Forest method is best with 40% of missing rate at 200 sample size. However with 20, 30 and 50% of missing rate, values of R 2 and r were closed to the one obtained with 40%. The error did not vary among missing data rate. With MICE method, the error did not vary among missing data rate as observed with Random Forest. Values of R 2 and r were better with 10, 30 and 50% of missing data rate than the values with 20 and 40%. However the difference were not important. Concerning Amelia method, large variation had been observed among missing data rate for all sample size. The error vary slightly under 100 as sample size and increased beyond 100. R 2 and r were better for 10 and 20% of missing data rate at 100 sample size.

Comparison of imputation methods
The mean and coefficient of variation the performance criteria according to imputation method and missing data mechanism are presented in Table 3. There is not a great variation among imputation method under the assumption that data is missing at random. However, LOCF method has the lowest value of R 2 (62.59%). For the other imputation methods used in this study, the values obtained were closed (65.22%; 66.46% and 66.19% respectively for AMELIA, RF and MICE). The error commits by the model (RMSE) was similar for all imputation methods. The coefficient of correlation was also low with LOCF (78.76%) compared to the others methods (80.45%; 81.44% and 81.07% respectively for AMELIA, RF and MICE).
When data is missing completley at random (MCAR), a similar trend is observed like under MAR assumption. The lowest R 2 and r were obtained with LOCF method (59.28%) when those of AMELIA, RF and MICE was respectively 66.68%, 65.77% and 66.34% for R 2 . About the coefficient of correlation, it also mow (76.64%) when using LOCF for imputation comparatively to AMELIA (81.43%), RF (80.78%) and MICE (81.19%). The RMSE was similar between methods (0.035 for RF and MICE; 0.037 and 0.034 respectively for AMELIA and LOCF).
Under the assumption that data is missing not at random (MNAR), the trend for R 2 and r were different contrary to the values observed when data is MAR and MCAR. R 2 was lower with AMELIA (64.49%) and LOCF (63%) than those of RF (67.35%) and MICE (66.82%). As observed with the other missing data mechanism, RMSE was similar under MNAR assumption. The error was 0.036 for AMELIA, 0.035 for LOCF and RF and 0.034 for MICE. About the coefficient of correlation, it does not vary greatly from a method to another. Thus we recorded 79.96%; 79.03%; 81.84% and 81.55% for AMELIA, LOCF, RF and MICE respectively.

Discussion
Effect of imputation method by missing data mechanism on the structure of hyper parameters of 3-MLP models For each imputation method, the interaction between AF:LR and AF:Node significantly impact the performances of the network for any missing data mechanism. The performance of the 3-MLP models is best when the network is trained with the TanH-Linear activation function, 11 nodes in the hidden layer and a learning rate of 50%. The accuracy of TanH-Linear activation function is much better than the other functions. For the number of node, even if the error continuous to decrease after 11 nodes, the gain of the model in term of prediction has not increased considerably. More, [39], [37] and [38] suggest to set the number of node in the hidden layer to a minimum as possible because a network with large number of nodes increases the computational time needed for training. Our findings about the number of nodes in the hidden layer are in agreement with those of [40] which state that the best approach to set the number of node in the hidden layer is to start with small number of node and increase until no major improvement in the performances is obtained. As regards the learning rate, after 50%, the predictive ability of the network is still increasing. But this increase in the predictive ability is not important and larger learning rate causes network to be more unstable as the error increase. Our results for the optimum learning rate are in agreement with those of [41] who says that if the value of the learning rate is large, the network may show oscillatory response because of the larger changes in the synaptic weight which may cause network to be unstable. However our optimum value of learning rate (50%) is less than 60% suggested by [42]. Another study conducted by [23] set the optimum learning rate as 35% which is less than the one of [42] and the one of our study. This difference can be due to the domain of application which is different.

Effect of imputation methods on size and missing data rate
Both size and missing data rate affect imputation methods. No matter the mechanism and the method used, the error increased when sample size increase for all missing rate. Apart from Amelia, the optimal size is 200 for the others methods under the three missingness mechanism. For Amelia the optimal sample size is 200 under MAR and MCAR assumption but 100 when the missingness mechanism is MNAR. LOCF can support missing data rate up to 50% with an optimal sample size of 200 under MAR and MNAR assumption. This method perform better with 10% under MCAR assumption. Since differences between missing data rate are not important, Random Forest and MICE can support up to 50% of missing rate at an optimal sample size of 200. Results are not in agreement with those of [43] who found that error decreased when the sample size increase no matter the missing rate. The difference might be explained by the fact that in our study imputed data pass through the network before evaluating the performances.

Comparison of imputation methods
Four imputation methods have been used in this study and results show that there is not a great variation among imputation method under the assumption that data is missing at random (MAR) and missing not at random (MNAR). However, Random Forest method and MICE seem to perform well than AMELIA and LOCF since they have less error. Our findings are in agreement with the results of [44]; [45]. These authors compare nine imputation methods by considering the three missingness mechanism (MAR, MCAR and MNAR). They found that MICE multiple imputation are overall the best approach. Another research of [33] compared the random forest method to kNN imputation [46], MissPALasso (a method based on EM algorithm, proposed by [47] and MICE [30]). For these authors, random forest could outperform other imputation methods. Results of this authors are similar to our findings. Indeed in this study random forest and MICE yields similar performances.
Under MCAR assumptions, LOCF is not indicated to handle missing data since it gives low R 2 . This agree with the conclusion of [48] who states that single imputation and LOCF are not optimal approaches for missing values imputation, as they can cause bias and lead to invalid conclusions. More, [49] states that single imputation are not solidly grounded in mathematical foundations and they exist merely for their ease of implementation. Most of imputation methods assume that data is missing at random. Our results show that even if this assumption is violated they perform well since the performances recorded for AMELIA, RF and MICE do not vary greatly from a missing data mechanism to another. This findings are in agreement with [50] who state that MICE is especially suitable in MAR settings. But [51] and [52] point out that MICE is also capable to deal with MNAR schemes.

Conclusions
The possibility to combine imputation methods to multilayer neural network have been accessed in this study through four methods (Amelia, LOCF, Random Forest and MICE) for any missing data mechanism by controlling the hyper-parameters (activation function, number of hidden neurons, learning rate). From our findings, single imputation is not an optimal approach to deal with missing data. However MICE multiple imputation and RF are more appropriate. Even if these methods outperform the two others (Amelia and LOCF), the best solution is to employ maximal efforts to avoid missing data during data collection. With regards to hyperparameters, to learn the model with the backpropagation algorithm, the performance criteria showed that the combination of TanH-Linear activation functions is best suited to implement the network with 11 nodes in the hidden layer with a learning rate of 50%. However, for further studies, most adapted and developed methods have to be compared with the best method found in this study using other learning methods.