A review of recurrent neural network architecture for sequence learning: Comparison between LSTM and GRU

Deep neural networks (DNNs) have made a huge impact in the field of machine learning by providing unbeatable humanlike performance to solve real-world problems such as image processing and natural language processing (NLP). Convolutional neural network (CNN) and recurrent neural network (RNN) are two typical architectures that are widely used to solve such problems. Time sequence-dependent problems are generally very challenging, and RNN architectures have made an enormous improvement in a wide range of machine learning problems with sequential input involved. In this paper, different types of RNN architectures are compared. Special focus is put on two well-known gated-RNN’s Long Term Short Memory (LSTM) and Gated Recurrent Unit (GRU). We evaluated these models on the task of force estimation system in pouring. In this study, four different models including multi-layers LSTM, multi-layers GRU, single-layer LSTM and single-layer GRU) were created and trained. The result suggests that multi-layer GRU outperformed other three models. Keywords—Recurrent neural network, Long-term short memory, Gated recurrent unit


Introduction
Nowadays, the recurrent neural network (RNN) has gained a lot of attention due to its outstanding performance in solving real-world machine learning problems, especially when it comes to dealing with sequential data and input-output data having different lengths [1][2][3][4][5][6][7][8][9][10][11][12]. Alex Graves in "Supervised Sequence Labelling with Recurrent Neural Networks" shows that RNNs are very powerful sequential learners [13]. One of the most popular areas where RNNs are usually used is in modeling natural language processing such as machine translation [14]. Other areas include handwriting recognition/generation [15], speech recognition [16], and human activities modeling [17].
In the past two decades, robots have taken charge in many areas such as medical centers [18], military [19] and indus-try [20]. Scientists have been trying to design and build robots that are capable of smooth and natural movements to do the tasks that are normally done by human beings. One of the most activities that can be helpful in lots of situations is pouring. We use pouring in our daily activities in an unconscious way.
To properly calculate the amount of transferring during the pouring process, humans use two factors: vision feedback and force feedback. Y. Huang et al. introduce an approach RNN model base on the LSTM to simulate the pouring activity [21].
In this work, the main focus is to compare different RNN architectures for this problem. We used a number of trainable parameters as a similarity factor to generate four different models. Models include 3-layer LSTM, 3-layer GRU, 1-layer LSTM and 1-layer GRU. The efficiency of LSTM and GRU are compared under the same conditions. Based on the experience in a very similar environment with the same features such as a number of epochs, loss function, optimizer and batch size, GRU produces better results than LSTM. This paper is organized as follows. Section II includes a brief background on the recurrent neural network and elaborates on the difference between LSTM and GRU. Section III describes the dataset and dataset preprocessing. In section IV, architectures and the training process are discussed. Section V shows the evaluated results and concludes the paper.

Background
In this section, we describe two recurrent units LSTM and GRU.

Long Term-Short Memory (LSTM)
LSTM was introduced in 1997 by Hochreiter & Schmidhuber to mitigate the vanishing gradient problem which previously recurrent networks would encounter [21] [22]. Instead of vanishing or exploding backpropagated errors, in LSTM, they will flow back through unlimited numbers of layers.

LSTM Unit
σ σ tanh σ tanh X X + X Figure 1: Diagram of a one-unit Long Short-Term Memory (LSTM) LSTM uses a memory unit; that is why it can solve the problems requires of memories of events that happened very long-time steps ago. Figure 1, illustrates one unit of LSTM.
LSTM has a chance to choose among read, write or reset the sell at each step. Equation 1 shows the complete updating process in LSTM units [23].

Gated Recurrent Unit (GRU)
GRU was introduced in 2014 by k. Cho el al. [24] as an alternative solution to alleviate the complexity of LSTM units. GRU has fewer trainable parameters because it does not have the output layers that LSTM has. Figure 2 illustrates one unit of GRU. Equation 2 shows the complete updating process in GRU units [23]. • GRU has two gates (reset and update), while LSTM has three gates (Input, output and forget). Reset gate in GRU handles how new inputs are combined with previous memory, and update gate handles how much of the previous state need to keep. The update gate does the same work of that input and forget gates do in LSTM.
• Unlike LSTM gate, GRU gate does not have the c t memory in each unit.
So, it is obvious that GRU is very similar to LSTM, and having less complexity and fewer parameters make it an interesting architecture to compare its performance to LSTM in the pouring scenario.

DATASET AND DATA PREPRATION
In this study, a dataset that includes 1307 motion sequences and their corresponding weight measurements was used. The data set has the shape of [# sequences, Max_length, # features]. The maximum length in this dataset is 1099, and any sequence that has a length less than 1099 was padded with zero at the end [25].
This dataset also has 10 features for each time steps: diameter of the receiving cup (mm) h rc height of the receiving cup (mm) d pc diameter of the pouring cup (mm) h pc height of the pouring cup (mm) ρ material density/water density (unitless) ]  Figure 4 illustrates a sequence of the dataset. The only rotation angle and weight change with time. Each feature has a different range size. In this work, the goal of this work is to predict the weight at time t (target/response) using 9 other features (predictors). The rest of this section talks about preprocessing features for neural network input.

Standardization
The original dataset was split into train, validation and test sets with ratios of 0.7, 0.2 and 0.1, respectively. The next step is dividing each of these three datasets into input and target. In this process, f_t is extracted from train_validation and test dataset. The final shape for these datasets became: Train_Validation_Input = (1176, 1099, 9) Train_Validation_Out put = (1176, 1099, 1) Test_Input = (131, 1099, 9) Test_Out put = (131, 1099, 1) Standardizing a dataset means to rescale the dataset. Standardization was done in order to bring all input features into the same range. So, the mean of values becomes 0 and the standard deviation becomes 1. In this paper, Standard scale used to scale the dataset's features. Figure 4 illustrates a time sequence after applying the scaler function. Test dataset scaled base on the data extracted from the training scaled process output.

Model
My goal in this paper was not to compete with previous work [21], instead I studied the performance of LSTM and GRU Network on this dataset. So, I chose 4 different models to compare their performance on this problem.

Designing Model
The Number of parameters was chosen as a key factor of similarity between models. So, all four models had the same number of parameters to become more likely to each other. The dataset includes zero paddings, and it is important to ignore those values during the training process because they are not real values and they only there to fix the maximum length size. Keras [26], introduced the "Masking" layer that can mask those data from going through the training process. In this paper, all of those models include one layer of Masking to ignore padding values. Between each layer, I put a dropout layer to help model trains better. Table 2 shows the parameter of these models.

Multi LSTM Model
This model includes a masking layer with a value of 0. (dataset include zero paddings) and input shape of (None, 9). The first layer should have input shape, and, in this case, nine features are the input. First LSTM has 55 hidden layers. The second has 27 and the third has 16. Between each layer, there is a dropout layer with a value of 0.2. At the end model used 3 fully connected layers with the size of 64, 29 and 1 (our output layer) with the activation function of relu.

Multi GRU Model
Like first model, this model also includes a masking layer at the first and then three layers of GRU with 64, 32 and 16 hidden layers. Also, it has 3 fully connected layers with dense of 64, 32 and 1 as the output with same activation function.

Single LSTM Model
The difference between this model and multi layers LSTM is this model only include one LSTM layer with 80 hidden layers and one output with dense 1.

Single GRU Model
This model also includes one layer of GRU with 93 hidden layers and output dense 1.

Train parameters
Loss function is one of two mandatory parameters that need to compile model. Picking a good loss function can improve the final result of the model. The Mean-Squared error (MSE) loss function picked for this study. MSE is a measure of the quality of a predictor. It's a non-negative value and values closer to the zero are better. The other important parameters for training is optimizer. Optimizer also plays a huge part in training process. Adam [27] chose for optimizer because Adam can generalize faster than other optimizers, so it can learn better in small size epochs.
Optimizers have a learning rate parameters that can be helpful in learning process. All of the models used 0.01 as their learning rate in this paper.  Table 1 shows that Multi GRU network perform better than the other models in small number of train steps. Increasing the amount of train steps will lead a model to learn better and produce a better result. So, for second part of test, the number of epochs increased to 45 and all the other parameters kept as the previous. Table 3 shows the results of the second test. Table 3 shows that increasing the number of epochs mad a huge change in the result of multi layers GRU. Both GRU models (single and multi) predicted better results when I increased the number of epochs. However, LSTMs models made a less accurate prediction. Table ?? illustrate the performance increase/decrease in these two test processes. In conclusion, this study showed that GRU worked better than LSTM for this specific problems. Also, results show that multi layers of GRU performed better than using single layer. Also, the results show that GRU take less amount of time to train. In the future, a better architecture can be used to generate a competitive result with existent architecture [21], and calculate the improvement performance. I would suggest creating the same architecture with GRU gates that introduce    in the "Learning to pour" paper and use same environment to produce a better performance on this problem.