A Memristor-based Cascaded Neural Networks for Specific Target Recognition

Multiply-accumulate calculations using a memristor crossbar array is an important method to realize neuromorphic computing. However, the memristor array fabrication technology is still immature, and it is difficult t o f abricate l arge-scale a rrays w ith h igh-yield, w hich restricts the development of memristor-based neuromorphic computing technology. Therefore, cascading small-scale arrays to achieve the neuromorphic computational ability that can be achieved by large-scale arrays, which is of great significance for promoting the application of memristor-based neuromorphic computing. To address this issue, we present a memristor-based cascaded framework with some basic computation units, several neural network processing units can be cascaded by this means to improve the processing capability of the dataset. Besides, we introduce a split method to reduce pressure of input terminal. Compared with VGGNet and GoogLeNet, the proposed cascaded framework can achieve 93.54% Fashion-MNIST accuracy under the 4.15M parameters. Extensive experiments with Ti/AlOx/TaOx/Pt we fabricated are conducted to show that the circuit simulation results can still provide a high recognition accuracy, and the recognition accuracy loss after circuit simulation can be controlled at around 0.26%.

use of limited memristor resources and make the system work at high speed in real-time processing 48 applications, we present a memristor-based cascaded framework with some neuromorphic processing 49 chip. That means several neural network processing chips can be cascaded to improve the processing 50 capability of the dataset. The basic computation unit in this work builds on our prior work devoloping 51 memristor-based CNNs architecture [20], which has validated that the three-layer CNNs with Abs 52 activation function can get desired recognition accuracy. 53 The rest of this paper is organized as follows, Section II presents the cascaded method based on 54 the basic computation unit and a split method, including the circuits implemented based on memristor 55 crossbar array. Section III exhibits the experimental results. The final Section IV concludes the paper.

57
To have a better understanding of cascaded method, we first introduce basic units that make up 58 the cascaded network, and then a detailed description of our cascaded CNN framework is presented. The simplified CNNs includes three layers. The convolution layer includes k kernels with kernel 63 size K s × K s followed by absolute nonlinearity function (Abs), which extracts the features from the 64 input images and produce the feature maps. The average-pooling obtains spatial invariance while  Aim to combine several monolithic networks (BCUs) to obtain better performance, we propose a 76 cascaded CNN network, whose specific design can be seen in Figure 2a.

77
The cascaded framework includes three parts. Given the output G ∈ R C×W×H generated from a 78 part, a reconstruction transformation f : R C×W×H → R C×W×H is applied to aggregate outputs over 79 all BCUs of this part, where C is the number of channels of input image, W and H are the spatial 80 dimensions. The output of k th part is described as 4 of 11 from the BCU outputs G 1_n ∈ R C 1 ×W 1 ×H 1 (n ∈ [1, M]). The next Part #2 includes N BCUs, which 89 take the outputs of the previous part as the inputs to produce the F 2 ∈ R C 2 ×W 2 ×H 2 . The final Part 90 #3 includes P BCUs, the first BCU takes the F 2 as the input to generate the G 3_1 ∈ R C 3 ×W 3 ×H 3 , and 91 the second BCU uses G 3_1 to produce the G 3_2 , and so on until G 3_P ∈ R C 3 × W 3 × H 3 is produced. This 92 cascaded mode is called the "M-N-P" type.

93
The typical "3-1-1" cascaded framework is shown in Figure 2b, it includes five BCUs which three 94 BCUs in parallel and two in series to produce the classification output. Based on BCU architecture, the computation unit can be treated as an image transformator.

97
According to Equation (1), the output G can be described as a multiply-addition calculation so that it 98 can be performed by several memristor.

99
As mentioned above, the BCU is a simplied CNN architecture. After the network training is 100 finished, the weight matrix also has some negative weights. Therefore, the converted mapping method 101 is needed to be applied. We assume that W represents a 2×2 weights matrix, and it includes positive  The BCU can produce entire output feature maps in one processing cycle. It no longer needs 109 to wait for an output feature map to be completed before the next operation can be executed. This 110 outcome means that it eliminates the requirement of a data storage device between each network layer.  A memristor-based computation unit architecture is shown in Figure 4.

driver circuit
Abs activation function in Equation (4). It should be noted that we use the Out 11 to cascade the next network, the 128 Out 12 is treated as the output when we use the single BCU for classification.

129
Out To enhance the driving capability of crossbar arrays, driver circuits are needed so that low current As we discussed in the previous section, the BCU can generate entire output feature maps in one 136 processing cycle. It means that BCU is a highly parallel systems, it will improve the time efficiency of 137 the classification. However, this calculation method will put pressure on the input terminal. Assumed 138 the W × H input image, each of the images is converted to a W · H × 1 voltage output vector by DAC 139 as the input dataflow. Each K s × K s field of the images will be sent to the K convX i module for the 140 convolution computation.

141
K convX i indicates that when using the i th convolution kernel to complete the X th convolution 142 computation, K convX i is the kernel crossbar discussed above. A K s × K s convolution computation of 143 a W × H image requires a total of (W − K s + 1) × (H − K s + 1) operations. The modules of each 144 convolution kernel are independent, so the K convX i does not need to be reconfigured.

145
In order to reduce the pressure on the input terminals, the BCU split method is need to be 146 considered. According to Equation (1), the input image I * will be sent to the BCU to complete the 147 convolutional calculation. As we can see, the input I * can be described as I * = I * 1 + I * 2 + ... + I * N , so 148 the Equation (1) can be rewritten as where H : R C×W×H → R C * ×W * ×H * is the transformation in the BCU, and ⊕ represents the  for example, this chip will perform the convolution calculation K convX i1 i11 in the portion of I * .

154
As we mentioned above, the fully connected layer is a process that calculates W · G * (here G * can be 155 regarded as R 1×m N ), and the chip #i will map the output to the m i−1 + 1 ∼ m i in G * . Several R C×W×H 156 outputs are generated by these chips and accumulated by ⊕ operation.
have the same format as the MNIST dataset and are 28×28 grey scale images with corresponding 176 labels l ∈ [0,9]. The Fashion-MNIST dataset is a step up from the MNIST dataset. It remains simple 177 enough that complex architectures, learning algorithms, and models are not needed to view progress 178 and soundly measure performance.

179
The weights of neural network are converted to conductivity values of memristor device by 180 Equation (6). The C max and C min indicate the maximum and minimum conductances, respectively. The

181
W is the original weights set, and the W max represents the maximum absolute value of the weights set, 182 and the C i is the conductance of the memristor crossbar array.
The 1T1R memristor crossbar array which stores the network weights is used for simulation. The    The device which has the multilevel characteristic can be repeatedly programmed to different target 209 resistance states from 1KΩ to 12KΩ, indicating the great potential for neuromorphic computing    MNIST and Fashion-MNIST datasets, the configuration details and performance is shown in Table 1.

226
8-Cascaded×14(4-2-2) slightly outperforms "4-2-1" (∼0.13%) with 10,000 more parameters. It can be 227 seen that cascaded models can achieve 93.54% accuracy under the 4.15M parameters. Plots in Figure 7 show the potential output of BCU obtained by the HSPICE simulation and the 243 software simulation output (used Python) in 10,000 test samples.   Figure 7c and 7f depict the boxplot of simulation difference for the output of kernels 252 and fully connected layers. As shown in Figure 7c, the differences are concentrated in the range of 253 −5 × 10 −6 to 10 × 10 −6 . In Figure 7f, the differences are concentrated in the range of −5 × 10 −3 to 254 5 × 10 −3 . This is two orders of magnitude different for the kernels and fully connected layer output.

255
Simulation experiments demonstrate that the output of the circuit is close to the software simulation,

256
indicating that the circuit implementation is feasible.
257 Figure 8 illustrates the relationship between the amount of model parameters and Fashion-MNIST 258 recognition accuracy. The amount of convolution kernels in the BCU is between 5 to 32, 9×9 kernel size 259 and 2×2 average pooling are applied. As shown in figure, the BCU can achieve 89.60% accuracy under 260 0.5M parameters (when the amount of convolution kernels is 32). We tested the five cascaded type 261 for their performance. It can be observed that as the number of parameters increases, the recognition 262 accuracy of the network gradually increases. The "4-3-2" cascaded type provides 93.12% accuracy with 263 4.15M parameters. It is worth noting that the accuracy of the BCU is better than"1-1-1" cascaded type