Comparison of Backpropagation and Kalman Filter-based Training for Neural Networks

: This work describes and compares the backpropagation algorithm with the extended 1 Kalman ﬁlter, a second-order training method which can be applied to the problem of learning 2 neural network parameters and is known to converge in only a few iterations. The algorithms are 3 compared with respect to their effectiveness and speed of convergence using simulated data for 4 both, a regression and a classiﬁcation task. 5


Introduction
Neural networks (NN) have been successfully used in noise filtering and state 9 estimation tasks [1]. Neural networks use backpropagation algorithm [2] in order to 10 update the parameters such that the difference between the prediction of the network 11 and the observed data is minimized [3]. However, such neural networks require a large where D is the dimensionality of the feature space describing an observation: where j denotes the unit, often also referred to as neuron, of the layer (1) and w (1) ji is a 46 learnable weight which passes from the i-th unit of the previous layer (the i-th input 47 feature in this case) to the j-th neuron of the current layer [11] . Moreover, b (1) j is referred 48 to as bias. In order to introduce nonlinearity to the network and thus be able to model all 49 kinds of functions, the weighted inputs z (1) j are transformed using a nonlinear activation 50 function σ(·) which yields the activation or output of a neuron: The output of this layer is then passed to the next layer, leading to a function of the 52 following form: where k denotes the neuron of the second layer of the network. Note, that the output 54 activations of the preceding layer a (1) j serve as input for this layer. 55 For ease of notation, these functions are typically defined in matrix notation, such 56 that a (l) refers to the output vector of the l-th layer, w (l) is the matrix of weights and b (l) 57 is the vector of biases. Using this notation, the previously described network of 3 layers 58 can then be described by the following set of functions [11]: The last layer is the output 59 layer, which outputs the value of y. In between are the hidden layers, which are called 60 hidden because the training data does not show the desired output for each of these 61 layer [2]. Hence, a typical 3 layer feedforward neural network would look as follows 62 (the input layer is not counted as layer): During training of the neural network one seeks to find the parameters w (l) and 70 b (l) that make the network best approximate the true function f * : x → y. To guide 71 this learning behavior the network requires a loss function that quantifies errors in the 72 prediction process. A typical loss function is the quadratic loss: where N denotes the total number of training observation, θ is the collection of parame-74 ters of the model and f (·) describes the composition of functions in the neural network 75 [11].

77
In order to adjusts the weights and biases, i. e. the parameters of the model, so 78 that the loss function C arrives at a minimum, the derivatives of the loss function with 79 respect to the weights and biases are needed. To compute these derivatives one can make 80 use of the chain rule of calculus, which states that for a function composition f (g(x)) 81 the derivative of f (·) w.r.t. x can be written as ∂ f (x)/∂g(x) × ∂g(x)/∂x [2]. For the cost 82 function of a neural network, which is a function composition of the many functions 83 building up the hidden and the output layers, the derivative with respect to the weights 84 can be computed as follows [12]: j is the weighted sum of the inputs plus a bias term, the derivative of it 86 with respect to the weight w ji is the input a to the weighted input z (l) j , the following notation is introduced: where δ j is usually referred to as error, since for the output units this term simplifies to 90 the difference between the true value y and its estimateŷ as we will see shortly [13]. In 91 order to compute δ (l) j , again the chain rule can be used: Now, the error of a layer (l) depend on the error of the succeeding layer (l + 1), raising where the sum goes over all neurons k in the output layer [13]. Since the output activation

104
Hence, this term can be written as σ (z (L) j ), which yields: Assuming a quadratic loss function as described in (2.7), the derivative of C w.r.t. the 106 network's output a (L) j ≡ŷ j is equal to (ŷ j − y j ) and hence the name error for the δ j terms 107 [11].
Again, these equations can be rewritten using the matrix notation which was 109 introduced in section 2. The error of the output layer then becomes 110 δ (L) = ∇ŷC σ (z (L) ) , (3.11) where ∇ŷC is the gradient of the cost function with respect to the output of the network 111 and is the pairwise multiplication operator, usually referred to as Hadamard product 112 [13]. For all other layers, function (3.7) changes to: Given the errors it is easy to compute the derivatives of the cost function with respect 114 to the weights and biases. As shown above, the error term δ (l) must be multiplied with Given a set of weights and input features, the backpropagation algorithm starts 120 with calculating the weighted inputs z (l) and activations a (l) iteratively for each layer 121 in ascending order by using forward propagation (see algorithm 1) Then, the error of 122 the output layer is computed using formula (3.11). Given δ (L) , all other errors can be 123 computed by iterating through all layers in reversed order and applying formula (3.7).

124
Finally, the derivatives of the cost function with respect to the weights and biases can be 125 returned by multiplying the error terms with the input of the respective layers and with 126 a vector of ones respectively [12].

127
Given a set of randomly initialized weights and biases, the networks weights can 128 be learned by an iterative optimization algorithm in combination with backpropagation.

129
Typically, the cost function is minimized using stochastic gradient descent or a variant  Randomly permute data x 6 for i = 1,...,N do 10 until stopping criterion is reached 11 yield Learned weights and biasesŵ (l) ,b (l) , 1 ≤ l ≤ L with the observations or measurements: where w k is the state of the system at time-step k, x k is an input of forces controlling 143 the system and ! k and˚k are the process and observation noises respectively [15]. Both 144 noise terms are assumed to be zero mean multivariate Gaussian noises with covariance 145 Q k and R k respectively. The nonlinear function f (·) relates the state at the current 146 time step k to the next time step k + 1 using the additional information x k about the 147 process. Likewise, h(·) relates the state w k to the observation y k [16]. The goal of the 148 Extended Kalman Filter is to find an estimateŵ k+1 of w k+1 given the observations or 149 measurements {y j } 0≤j≤k [6].

150
One can show that this estimate can be obtained by the recursion where K k is referred to as the Kalman gain which specifies how much weight should should then be estimated [6]. Therefore, we let all weights and biases of the network be According to [9], a neural network's behavior can then be described by the following 165 nonlinear discrete-time system It follows, that the weights of the neural network can be estimated using the EKF

196
In contrast to the standard stochastic gradient descent with backpropagation algo-197 rithm, whose only hyperparameter is the learning rate α, the EKF algorithm relies on the 198 parameters R k , P k and Q k .

199
[10] note that the measurement noise R k is equivalent to the inverse of the learning learning rate a priori [17]. Hence, this parameter must be tuned, using for example evaluated on a test data set [18]. However, in scenarios of signal processing, where data Algorithm 4: EKF-Training 1 Function KALMANTRAINING(x, P 0 , R 0 , Q 0 ) 2 Randomly initialize weights and biases observe training pattern with label y k and input vector The update equation for R k is then defined as follows: where η is a forgetting factor that is used to average the sum of squared residuals 217 k T k over time and thus approximate its expected value. [19] recommend setting this 218 parameter to a value of 0.3. For the initial measurement noise R 0 this work follows 219 [10], who recommend setting the learning rate α to a small value of around 0.01, which 220 corresponds with a R 0 of 100, due to its inverse relation.

221
The process noise Q k is an important factor as it can help to prevent the error 222 covariance matrix P k from converging towards zero, which would imply a Kalman filter 223 of zero and thus that no learning is taking place anymore [20]. [19] therefore derived an 224 adaptive estimation of Q k that uses the innovation ξ k = (y k − h(x k ,ŵ k )). The authors 225 show, that the process noise covariance can be update using the following equation: where again η is a forgetting factor. The initial value Q k is usually set to qI, with q being 227 a small value. In this work, q is chosen to be 10 −2 to initialize the process noise.

228
Since the state covariance matrix P k is already iteratively updated by the extended 229 Kalman Filter, its initial value is not as important as the former two [21]. However, [10] 230 recommend setting it to −1 I, where −1 is a small number from the range 0.001-0.01 231 and I is the identity matrix. The authors state that by setting P 0 to a diagonal matrix, the 232 fact that weights are initialized randomly without correlating on each other is reflected.

233
Moreover, due to the random initialization, the diagonal entries are set to rather high 234 values to account for the resulting uncertainty associated with the initial stateŵ 0 .

235
In fact, for neural networks, the initial stateŵ 0 should generally be drawn inde-236 pendently from a uniform or normal distribution. [22] for example recommend to draw 237 weights w (l) independently from a normal distribution with zero mean and standard 238 deviation equal to 2/n (l−1) , with n (l−1) being the number of neurons in layer (l − 1).

241
To evaluate the computational efficiency and predictive effectiveness of the ex-242 tended Kalman Filter for training neural networks, it will be used together with the 243 standard backpropagation (SBP) algorithm to fit neural networks on different data sets.

244
The network architecture regarding the hidden layers is the same for every exper- Next to the regression problem, the classical XOR ("exclusive or") classification 260 problem is considered. Given two input features x 1 and x 2 , the XOR function computes:  The quality of the fit resulting from the EKF algorithm can also be attributed to the 269 recursive update of the measurement noise R k and process noise Q k through formulas (   minimized and has also been used to produce the results seen in figure 4. Indeed, when 287 changing the loss of SBP to the mean squared error, the EKF even produces a smaller 288 error than the SBP. However, the implicit assumption of a quadratic loss presents a major 289 flaw of the EKF algorithm for training neural networks, as also noted by [10]. if labeled data are scarce or the data generating process is highly non-stationary [9]. 298 Moreover, the recursive update strategy of the noise parameters R k and Q k makes the 299 model more robust against improper initial values of these parameters, which especially 300 in noisy scenarios has lead to better predictive performance. Future research should 301 now investigate how the EKF method can also be used to learn the parameters of neural 302 networks with loss functions other than the mean square error.