Estimation of Missing Streamflow Data Using Anfis 2 Models and Determination of the Number of 3 Datasets for Anfis : The Case of Ye ş il ı rmak River

Abstract: Good data analysis is required for the optimal design of water resources projects. 10 However, data are not regularly collected due to material or technical reasons, which results in 11 incomplete-data problems. Available data and data length are of great importance to solve those 12 problems. Various studies have been conducted on missing data treatment. This study used data 13 from the flow observation stations on Yeşilırmak River in Turkey. In the first part of the study, 14 models were generated and compared in order to complete missing data using ANFIS, multiple 15 regression and Normal Ratio Method. In the second part of the study, the minimum number of data 16 required for ANFIS models was determined using the optimum ANFIS model. Of all methods 17 compared in this study, ANFIS models yielded the most accurate results. A 10-year training set was 18 also found to be sufficient as a data set. 19


Introduction
Both the growing population and the rapidly developing industrialization lead to an increased demand for water.The limited availability of resources results in a number of problems in meeting the demand.Exploitation of unused water resources or using existing water resources in an optimum way can be a solution to these problems.Optimal utilization of available water resources, in particular, requires a good analysis of data.Due to the small number of stations in project areas or insufficient data length, some studies have been undertaken to generate new data using existing measurement stations [1].These hydrological studies have mainly focused on precipitation [2], evaporation [3] and river flows [4].
Studies on missing data treatment generally address data correlation [5,6], back-propagation (BP) neural network using Artificial Intelligence [7], ANFIS models [8,9] and models using artificial neural networks (ANN) [10,11].In addition, Fuzzy studies [12], in which modeling is based on pure expert knowledge, are also important.Some studies on missing data treatment using ANFIS are the completion of missing flow data of the Middle Euphrates basin [13], completion of missing precipitation data in Serbia [14] and Malaysia [15], and completion of missing flow data and modelling of sediment transport of Terengganu River, Malaysia [16] and Gediz River, Turkey [17].
This study investigated the monthly data of the stations of Yeşilirmak River in the North of Turkey.In the first part of the study, multiple regression tests based on interstation correlations were performed.In the second part of the study, an optimum data completion model was selected using ANFIS.In the last part of the study, the number of data required for a correct prediction was searched and the minimum number of data required for reliable estimates was discussed.

Materials and Methods
The first part of this section of the study will present information and statistics on Yeşilırmak River and its stations.The second part will provide information on the classical method, multiple regression method and ANFIS used in the study.

Yeşilırmak River and Stations
The Yeşilırmak basin, one of the 25 basins in Turkey, is located between latitudes 39° 30' and 41° 21' and longitudes 34° 40' and 39° 48' (Figure 1).The basin is named after Yeşılırmak River.The main river channel of the basin is 519 km in length.The main tributaries of Yeşilırmak River are Kelkit, Çekerek, Çorum, Çat and Tersakan streams.Estimated to be about 3,8 million ha, Yeşilırmak basin is the third largest basin in Turkey [18,19].  1 shows the statistics of the stations.Table 2 summarizes the correlation between the stations.

Missing Data Treatment Using Normal Ratio Method
In this method, each input data is divided by its annual average value, and these values are multiplied by the average of the station (average of data) whose missing data are to be completed.
All input values obtained in the last stage of the calculation are summed, and divided by the number of inputs so that the missing data are completed [16].
where is the flow rate and is the number of input stations.

Multiple Regression Analysis
Multiple regression analysis is a statistical method for determining the mathematical dimension of the relationship between variables affecting each other.The value to be estimated using the equation formulated based on multiple regression analysis is written in the form of a function of values affecting it [21].
where is the dependent (estimated) variable, is the independent (explanatory) variable, is the regression coefficient, is the number of input parameters and is the error term.
Multiple linear regression analysis can be used when data are normally distributed, the relationship between independent variables and dependent variable is linear, and error variance for each independent variable is constant [22].

ANFIS (Artificial Neural Network Fuzzy Inference Systems)
Developed by Jang in [22], ANFIS is a modeling method that combines Fuzzy Logic and YSA models.Different from Fuzzy Logic, ANFIS is based on the use of data for the automatic acquisition of rules.ANFIS structure uses artificial neural networks' learning ability and fuzzy logic inference, and therefore, it is more successful than when artificial neural networks model or fuzzy logic is used alone.When input and output values are known, ANFIS determines all possible rules or allows them to be generated using input and output values (Figure 2).ANFIS structure consists of five layers: fuzzification layer, rule layer, normalization layer, defuzzification layer and summation layer (Figure 3).The first and fourth layers are adaptable [23,24].

Results
Models were developed using Yeşilırmak River data for the estimation of missing data of stations 1402 and 1413.Two-input-one-output models and four-input-one-output models were developed to complete the missing data of station 1402.The two-input-one-output models were used as the output of station 1402.Stations 1413 and 1401 connected to station 1402 on the left-and righthand sides, respectively, were used to estimate station 1402.In addition to these stations, stations 1412 and 1414 connected to station 1413 on the left-and right-hand sides, respectively, were used to estimate station 1413 in the four-input-one-output models.A two-input-one-output model was developed, and stations 1414 and 1412 were used for the estimation of missing data of station 1413.
In the models developed for stations 1402 and 1413, classical and multiple regression models were constructed and compared as well as the ANFIS method.In the last part of the study, the minimum number of data required to reach the correct result using ANFIS models was obtained.

First Data Set Models
The aim of these models was to complete the missing data of station 1402.For this, the data of stations 1413 and 1401 were used.Of 540 data, the first 400 were used for training and the remaining for testing.In addition to ANFIS models generated by changing the number of sets of input parameters, classical method and the multiple regression model were used to compare the results (Table 3).
The equation of the classical method is: The equation of the multiple regression model is: The results of this part of the study show that ANFIS models are not superior to the classical and multiple regression models but that all ANFIS models yield better results than the other two methods.
The models in which each input has 5 subsets are the optimum models.The speed of the training phase of the model is also noteworthy.The comparison of the values obtained from the optimum ANFIS model with the observed values shows that the errors of both the minimum and maximum flow values are very few (Figure 4).

Second Data Set Models
These models also aimed to complete the missing data of station 1402.To achieve this, the data of stations 1412 and 1414 as well as those of 1413 and 1401 were used.Of 504 data, the first 405 were used for training and the remaining for testing.In addition to ANFIS models generated by changing the number of sets of input parameters, classical method and multiple regression model were used to compare the results (Table 4).The results show that ANFIS models provide more accurate results than the classical model and worse results than multiple regression models.The models with 4 inputs are quite slow, especially when the number of subsets of inputs is greater than 5.The models with an increasing number of inputs are much slower than multiple regression models.The comparison of the values of the optimum ANFIS model with the observed values shows that although the largest error is at the minimum flow values, this error is quite small at the maximum flow values (Figure 5).

Third Data Set Models
The aim of these models was to complete the missing data of station 1413.For this, the data of stations 1412 and 1414 were used.Of 504 data, the first 405 were used for training and the remaining for testing.The results were compared using the classical method and multiple regression model as well as ANFIS models generated by changing the number of sets of input parameters (Table 5).
The equation of the classical method is: The equation of the multiple regression model is: The results show that ANFIS models provide more accurate results than the classical and multiple regression models.Although the results are not as good as those in the first data set, they remain within acceptable error limits.The models in which each input has 5 subsets are the optimum models.The comparison of the values of the optimum ANFIS model with the observed values shows that the errors of both the minimum and maximum flow values are very few (Figure 6). the model of choice for this purpose.The models were trained using a 10-year data set and the procedure was repeated year by year.The evaluation of the results is summarized in Table 4.The number of data used for ANFIS model training does not affect the regression coefficient of the training data very much (Table 6).However, the regression results obtained during the testing of the models show that the results of the 10-year data set and model are very similar to those of the 33year data and training (Figure 7).The error values show that the 8-year data set is sufficient for training (Figure 8).In conclusion, a 10-year data set may be sufficient for Anfis model training.

Conclusions
It is not always possible to collect long and coordinated data to optimally use water resources projects.The aim of this study was to develop ANFIS models for stations on Yeşilırmak River in order to solve this problem and improve the existing methods.Another aim of the study was to investigate how the number of input parameters and amount of data affect ANFIS models.For this purpose, 3 different data sets were analyzed.
Results show that besides classical and regression models, ANFIS models can be used to complete missing flow data.ANFIS models yield very accurate results especially when the number of input parameters is small.However, multiple regression models yield better results than ANFIS models when the number of input parameters is large.In addition, it takes ANFIS models longer to achieve results when the number of input parameters increases.Lastly, at least a 10-year data is required for a reliable ANFIS model training phase.
In conclusion, the ANFIS modelling yield accurate results and therefore can be used to complete missing data when the number of input parameters is small and data set is older than 10 years.
It is not always possible to collect long and coordinated data to optimally use water resources projects.The aim of this study was to develop ANFIS models for stations on Yeşilırmak River in order to solve this problem and improve the existing methods.Another aim of the study was to investigate how the number of input parameters and amount of data affect ANFIS models.For this purpose, 3 different data sets were analyzed.
Results show that besides classical and regression models, ANFIS models can be used to complete missing flow data.ANFIS models yield very accurate results especially when the number of input parameters is small.However, multiple regression models yield better results than ANFIS models when the number of input parameters is large.In addition, it takes ANFIS models longer to achieve results when the number of input parameters increases.Lastly, at least a 10-year data is required for a reliable ANFIS model training phase.
In conclusion, the ANFIS modelling yield accurate results and therefore can be used to complete missing data when the number of input parameters is small and data set is older than 10 years.
This section is not mandatory, but can be added to the manuscript if the discussion is unusually long or complex.

Figure 1 .
Figure 1.Site location map of Yeşilırmak Basin

Figure 4 .
Figure 4. Comparison of observed data and estimated data of Anfis (5-5) model test data for the first data set

Figure 5 .
Figure 5.Comparison of observed data and estimated data of Anfis (5-5) model test data for the second data

Figure 6 .
Figure 6.Comparison of observed data and estimated data of Anfis (5-5) model test data for the third data set

Figure 7 .Figure 8 .
Figure 7. Regression values for the number of data used for ANFIS model training

Table 1 .
Statistical analysis of data from stations

Table 2 .
Correlation between data from stations

Table 3 .
Training and testing data results of models developed for the first data set Preprints (www.

Table 4 .
Training and testing data results of models developed for the second data set

Table 6 .
Regression and error values for the number of data used for ANFIS model training