2. Related Works
The important application of convolutional neural network (CNN) is image classification although there are some other applications of CNN like image retrieval, object detection, feature extraction, image recognition, etc. and so, related works described in this section are focused on association of CNN and image classification. Recall that standard and traditional CNN consists of two connected network such as convolutional network and dense network where convolutional network is the consequential set of convolutional layers filtered by small enough filter kernel and dense network is feedforward network (FFN) as normal artificial neural network (ANN) with note that dense network is also called fully connected network (FCN). A specific convolutional layer is pooling layer whose kernel operator is selection operation for selecting specific value in the range of the kernel instead of taking multiplication as usual. If the specific value is selected as maximum value with the range of kernel, pooling layer is called max-pooling layer. Pooling layer and its pooling kernel have an important role in CNN because pooling kernel aims to extract dominant pixels in image. As a convention, convolutional layer is called c-layer with filtering kernel (filter) and pooling layer with pooling kernel is called p-layer whereas traditional layer in dense network with parametric weights is simply called w-layer. CNN-based classification methods are grouped in three approaches (Chen, et al., 2021): 1) traditional CNN-based approach called classic approach conforms strictly definition of CNN including two networks such as convolutional network (convolutional layers + pooling layers) and dense network, 2) Inception-based approach follows the concept of Inception, 3) residual approach follows the concept of residual network aiming to solve overfitting problem due to another problem of gradient vanishing, 4) attention-based approach applies attention mechanism into CNN classifier, 5) transformer-based approach which is the most recent approach until now (2025) applies transformer developed in large language model (LLM) into CNN-based classifier with note that transformer is an improvement of both attention mechanism and pre-training model where pre-training model is originally developed for domain of computer vision.
In 1998, LeCun et al. (LeCun, Bottou, Bengio, & Haffner, 1998) developed LeNet model well-known as pioneering research which established fundamental principles of classic approach for CNN classifier. LeNet model also called LeNet-5 network consists of two successive network such as convolutional network and dense network where convolutional network has 4 c-layers and dense network has 3 w-layers, which means that LeNet-5 has 7 layers. Especially, each c-layer among 4 convolutional layers is a set of sub-convolutional layers (called sub-c-layers) where every sub-c-layer is resulted from (owns) one filter. There are 2 c-layers and 2 p-layers among such 4 convolutional layers, following the ordering (Chen, et al., 2021, p. 10): the first 28x28x6 c-layer denoted conv1 resulted from 6 5x5 filtering kernels (stride 1, zero-padding) on every MNIST 32x32x1 image → the second 14x14x6 p-layer denoted maxpool1 resulted from 2x2 pooling kernel (stride 2, zero-padding) on the first c-layer → the third 10x10x16 c-layer denoted conv2 resulted from 16 5x5x6 filtering kernels (stride 1, zero-padding) on the second p-layer → the fourth 5x5x16 p-layer denoted maxpool2 resulted from 2x2 pooling kernel (stride 2, zero-padding) on the third c-layer.
Table 2.1.
Convolutional network of LeNet-5.
Table 2.1.
Convolutional network of LeNet-5.
| Layer |
Input Dimension
|
Filter Dimension
|
Filter Count
|
Output Dimension
|
|
conv1 |
32 x 32 |
5 x 5 |
6 |
28 x 28 x 6 |
|
maxpool1 |
28 x 28 x 6 |
2 x 2 |
|
14 x 14 x 6 |
|
conv2 |
14 x 14 x 6 |
5 x 5 x 6 |
16 |
10 x 10 x 16 |
|
maxpool2 |
10 x 10 x 16 |
2 x 2 |
|
5 x 5 x 16 |
|
conv3 |
5 x 5 x 16 |
5 x 5 x 16 |
120 |
1 x 1 x 120 |
The fourth maxpool2 p-layer whose dimension is 5x5x16 is filtered again by 120 5x5x16 filtering kernels (stride 1, zero-padding) so as to be flatten as a 120-element vector denoted conv3 which is the first w-layer of dense network whereas the second w-layer is 84-element vector and the third w-layer is 10-element vector representing 10 classes. Within convolutional network, there are 6 5x5 filtering kernels plus 16 5x5 filtering kernels plus 120 5x5x16 filtering kernel, which issues requirement of 50,550 = 6*5*5 + 16*5*5*6 + 120*5*5*16 parameters. Among 3 w-layers of dense network, there are 120 weights multiplied with 84 weights plus 84 weights multiplied with 10 weights, which issues 11,014 = 120*84 + 84 (bias) + 84*10 + 10 (bias) parameters. In general, LeNet-5 needs 61,564 = 50,550 + 11,014 parameters. Pooling layers in LeNet-5 are max-pooling layers and activation function is sigmoid function. Moreover, LeNet-5 is the first CNN classifier that applied backpropagation algorithm into training convolutional layers when previous CNN applications may fix filters of c-layers like blurring filter, edge-detection filter, etc.
More than ten years later, in 2012, Krizhevsky et al. (Krizhevsky, Sutskever, & Hinton, 2012) proposed AlexNet model which gave better result in CNN classification consists of two successive networks such as convolutional network and dense network where convolutional network has 8 c-layers and dense network has 3 w-layers, which means that AlexNet has 11 layers. Because the architecture of AlexNet does not go too beyond LeNet-5, it still belongs to classic approach in CNN classification. However, there are many significant improvements of AlexNet in comparison with LeNet-5 (Chen, et al., 2021, p. 10) such as ReLU activation function, dropout technique to alleviate overfitting problem, data augmentation to alleviate overfitting problem, and overlapping pooling to also alleviate overfitting problem. The AlexNet architecture is more complicated than LeNet-5 when there are 5 c-layers and 3 p-layers among 8 convolutional layers of convolutional network, following the ordering (Chen, et al., 2021, p. 11): the first 55x55x96 c-layer denoted conv1 resulted from 96 11x11x3 filtering kernel (stride 4, zero-padding) on every 227x227x3 image → the second 27x27x96 p-layer denoted maxpool1 resulted from 3x3 pooling kernel (stride 2, zero-padding) on the first c-layer → the third 27x27x256 c-layer denoted conv2 resulted from 256 5x5x96 filtering kernels (stride 1, zero-padding) on the second p-layer → the fourth 13x13x256 p-layer denoted maxpool2 resulted from 3x3 pooling kernel (stride 2, zero-padding) on the third c-layer → the fifth 13x13x384 c-layer denoted conv3 resulted from 384 3x3x256 filtering kernels (stride 1, padding 1) on the fourth p-layer → the sixth 13x13x384 c-layer denoted conv4 resulted from 384 3x3x384 filtering kernel (stride 1, padding 1) on the fifth c-layer → the seventh 13x13x256 c-layer denoted conv5 resulted from 256 3x3x384 filtering kernel (stride 1, padding 1) on the sixth c-layer → the eighth 6x6x256 p-layer denoted maxpool3 resulted from 3x3 pooling kernel (stride 2, zero-padding) on the seventh c-layer.
Table 2.2.
Convolutional network of AlexNet.
Table 2.2.
Convolutional network of AlexNet.
| Layer |
Input Dimension
|
Filter Dimension
|
Filter Count
|
Output Dimension
|
|
conv1 |
227 x 227 x 3 |
11 x 11 x 3 (stride 4) |
96 |
55 x 55 x 96 |
|
maxpool1 |
55 x 55 x 96 |
3 x 3 (stride 2) |
|
27 x 27 x 96 |
|
conv2 |
27x27x96 |
5 x 5 x 96 (stride 1) |
256 |
27 x 27 x 256 |
|
maxpool2 |
27 x 27 x 256 |
3 x 3 (stride 2) |
|
13 x 13 x 256 |
|
conv3 |
13 x 13 x 256 |
3 x 3 x 256 (stride 1) |
384 |
13 x 13 x 384 |
|
conv4 |
13 x 13 x 384 |
3 x 3 x 384 (stride 1) |
384 |
13 x 13 x 384 |
|
conv5 |
13 x 13 x 384 |
3 x 3 x 384 (stride 1) |
256 |
13 x 13 x 256 |
|
maxpool3 |
13 x 13 x 256 |
3 x 3 (stride 2) |
|
6 x 6 x 256 |
The eighth p-layer maxpool3 whose dimension is 6x6x256 (=1x1x9216) is flattened into 9216-element vector as input of dense network so that the first w-layer of dense network is 4096-element vector whereas the second w-layer is also 4096-element vector and the third w-layer is 1000-element vector corresponding to 1000 classes. Note, pooling kernel in AlexNet is max-pooling kernel. Within convolutional network, there are 96 11x11x3 filtering kernel plus 256 5x5x96 filtering kernel plus 384 3x3x256 filtering kernels plus 384 3x3x384 filtering kernels plus 3x3x256 filtering kernel, which issues requirement of 3,745,824 = 96*11*11*3 + 256*5*5*96 + 384*3*3*256 + 384*3*3*384 + 256*3*3*384 parameters. Within dense network, there are 9216 weights multiplied with 4096 weights plus 4096 weights multiplied with 4096 weights plus 4096 weights multiplied with 1000 weights, which issues 58,631,144 = 9216*4096 + 4096 (bias) + 4096*4096 + 4096 (bias) + 4096*1000 + 1000 (bias) parameters. In general, AlexNet needs at least 62,376,968 = 3,745,824 + 58,631,144 parameters.
The VGG model developed by Simonyan and Zisserman (Simonyan & Zisserman, 2015), which belongs to classic approach, aims to define a well-structured model by modular architecture. Exactly, VGG architecture still includes convolutional network and dense network but its convolutional network is modularized into five blocks and each bock has some c-layers and one p-layer (Gemini 2025). Output dimension is decreased doubly block-after-block, which is called dimensionality reduction, whereas the number of filters (filtering kernels) is increased doubly block-after-block, which is called filter progression (Gemini 2025). Moreover, the number of c-layers in each block is increased sequentially with note that size of filtering kernels is fixed by dimension 3x3 and size of pooling kernels is fixed by dimension 2x2 whereas stride of filtering kernels is fixed by 1 and stride of pooling kernels is fixed by 1 (Gemini 2025). For instance, the first block has two c-layers and one p-layer where the first 224x224x64 c-layer denoted conv1_1 is resulted from 64 3x3x3 filtering kernels (stride 1, zero-padding) on every 224x224x3 image and the second 224x224x64 c-layer denoted conv1_2 is resulted from 64 3x3x64 filtering kernels (stride 1, zero-padding) on the first c-layer conv1_1 whereas the third 112x112x64 p-layer denoted maxpool1 is resulted from 2x2 max-pooling kernel (stride 2) on the second c-layer conv1_2. Therefore, output dimension of the first block is 112x112x64 and the first block requires at least 38,592 = 64*3*3*3 (conv1_1) + 64*3*3*64 (conv1_2). Consequently, the second block which connects directly to the first block has two c-layers and one p-layer where the first 112x112x128 c-layer denoted conv2_1 is resulted from 128 3x3x64 filtering kernels (stride 1, zero-padding) on the 112x112x64 output of the first block and the second 112x112x128 c-layer denoted conv2_2 is resulted from 128 3x3x128 filtering kernels (stride 1, zero-padding) on the first c-layer conv2_1 whereas the third 56x56x128 p-layer denoted maxpool2 is resulted from 2x2 max-pooling kernel (stride 2) on the second c-layer conv2_2. Therefore, output dimension of the second block is 56x56x128 and the second block requires at least 221,184 = 128*3*3*64 (conv2_1) + 128*3*3*128 (conv2_2). Consequently, the third block which connects directly to the second block has three c-layers and one p-layer where the first 56x56x256 c-layer denoted conv3_1 is resulted from 256 3x3x128 filtering kernels (stride 1, zero-padding) on the 56x56x128 output of the second block, the second 56x56x256 c-layer denoted conv3_2 is resulted from 256 3x3x256 filtering kernels (stride 1, zero-padding) on the first c-layer conv3_1, and the third 56x56x256 c-layer denoted conv3_3 is resulted from 256 3x3x256 filtering kernels (stride 1, zero-padding) on the second c-layer conv3_2 whereas the fourth 28x28x256 p-layer denoted maxpool3 is resulted from 2x2 max-pooling kernel (stride 2) on the third c-layer conv2_3. Therefore, output dimension of the third block is 28x28x256 and the third block requires at least 1,474,560 = 256*3*3*128 (conv3_1) + 256*3*3*256 (conv3_2) + 256*3*3*256 (conv3_3). Consequently, the fourth block which connects directly to the third block has three c-layers and one p-layer where the first 28x28x512 c-layer denoted conv4_1 is resulted from 512 3x3x256 filtering kernels (stride 1, zero-padding) on the 28x28x256 output of the third block, the second 28x28x512 c-layer denoted conv4_2 is resulted from 512 3x3x512 filtering kernels (stride 1, zero-padding) on the first c-layer conv4_1, and the third 28x28x512 c-layer denoted conv4_3 is resulted from 512 3x3x512 filtering kernels (stride 1, zero-padding) on the second c-layer conv4_2 whereas the fourth 14x14x512 p-layer denoted maxpool4 is resulted from 2x2 max-pooling kernel (stride 2) on the third c-layer conv4_3. Therefore, output dimension of the fourth block is 14x14x512 and the fourth block requires at least 5,898,240 = 512*3*3*256 (conv4_1) + 512*3*3*512 (conv4_2) + 512*3*3*512 (conv4_3). Consequently, the fifth block which connects directly to the fourth block has three c-layers and one p-layer where the first 14x14x512 c-layer denoted conv5_1 is resulted from 512 3x3x512 filtering kernels (stride 1, zero-padding) on the 14x14x512 output of the fourth block, the second 14x14x512 c-layer denoted conv5_2 is resulted from 512 3x3x512 filtering kernels (stride 1, zero-padding) on the first c-layer conv5_1, and the third 14x14x512 c-layer denoted conv5_3 is resulted from 512 3x3x512 filtering kernels (stride 1, zero-padding) on the second c-layer conv5_2 whereas the fourth 7x7x512 p-layer denoted maxpool5 is resulted from 2x2 max-pooling kernel (stride 2) on the third c-layer conv5_3. Therefore, output dimension of the fifth block is 7x7x512 and the fifth block requires at least 7,077,888 = 512*3*3*512 (conv5_1) + 512*3*3*512 (conv5_2) + 512*3*3*512 (conv4_3).
Table 2.3.
Five blocks (convolutional network) of VGG.
Table 2.3.
Five blocks (convolutional network) of VGG.
| Block |
Layer |
Input Dimension |
Filter Dimension |
Filter Count |
Output Dimension |
| 1 |
conv1_1 |
224 x 224 x 3 |
3 x 3 x 3 |
64 |
224 x 224 x 64 |
|
conv1_2 |
224 x 224 x 64 |
3 x 3 x 64 |
64 |
224 x 224 x 64 |
|
maxpool1 |
224 x 224 x 64 |
2 x 2 |
|
112 x 112 x 64 |
| 2 |
conv2_1 |
112 x 112 x 64 |
3 x 3 x 64 |
128 |
112 x 112 x 128 |
|
conv2_2 |
112 x 112 x 128 |
3 x 3 x 128 |
128 |
112 x 112 x 128 |
|
maxpool2 |
112 x 112 x 128 |
2 x 2 |
|
56 x 56 x 128 |
| 3 |
conv3_1 |
56 x 56 x 128 |
3 x 3 x 128 |
256 |
56 x 56 x 256 |
|
conv3_2 |
56 x 56 x 256 |
3 x 3 x 256 |
256 |
56 x 56 x 256 |
|
conv3_3 |
56 x 56 x 256 |
3 x 3 x 256 |
256 |
56 x 56 x 256 |
|
maxpool3 |
56 x 56 x 256 |
2 x 2 |
|
28 x 28 x 256 |
| 4 |
conv4_1 |
28 x 28 x 256 |
3 x 3 x 256 |
512 |
28 x 28 x 512 |
|
conv4_2 |
28 x 28 x 512 |
3 x 3 x 512 |
512 |
28 x 28 x 512 |
|
conv4_3 |
28 x 28 x 512 |
3 x 3 x 512 |
512 |
28 x 28 x 512 |
|
maxpool4 |
28 x 28 x 512 |
2 x 2 |
|
14 x 14 x 512 |
| 5 |
conv5_1 |
14 x 14 x 512 |
3 x 3 x 512 |
512 |
14 x 14 x 512 |
|
conv5_2 |
14 x 14 x 512 |
3 x 3 x 512 |
512 |
14 x 14 x 512 |
|
conv5_3 |
14 x 14 x 512 |
3 x 3 x 512 |
512 |
14 x 14 x 512 |
|
maxpool5 |
14 x 14 x 512 |
2 x 2 |
|
7 x 7 x 512 |
Consequently, the output of the fifth final block whose dimension is 7x7x512 is flattened into 25,088 = 7*7*512 features as 25,088-element vector as input of dense network consisting of three w-layers. Therefore, the first w-layer, the second w-layer, and the third w-layer of dense network are 4096-element vector, 4096-element vector, and 1000-element vector (corresponding to 1000 image classes), respectively. Five blocks of convolutional network, there are 14,710,464 = 38,592 + 221,184 + 1,474,560 + 5,898,240 + 7,077,888 parameters. Among three w-layers of dense network, there are 25,088 weights multiplied with 4096 weights plus 4096 weights multiplied with 4096 weights plus 4096 weights multiplied with 1000 weights, which issues 123,642,856 = 25,088*4096 + 4096 (bias) + 4096*4096 + 4096 (bias) + 4096*1000 + 1000 (bias) parameters. In general, VGG needs at least 138,353,320 = 14,710,464 + 123,642,856 parameters. Although the number of parameters of VGG is much larger than LeNet-5 and AlexNet, VGG model can be easily customized because of its flexible architecture with modular blocks. The VGG model described here is also called VGG-16 because its c-layers are grouped in every block so that the total number of convolutional layers is 13 (conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3) plus 3 w-layers in order to issue 16 weighted layers (Gemini 2025). Another version of VGG model is VGG-19 including 16 convolutional layers plus 3 w-layers in order to issue 19 weighted layers (Gemini 2025).
Network-in-network (NiN) model proposed by Lin et al. (Lin, Chen, & Yan, 2014) which is the last generation one belonging to classic approach is an significantly methodology in CNN based classification because NiN decreases significantly complexity of linear filter operator in multi-channels although the filter operator decreases significantly complexity of weight multiplication in feed-forward network (FFN). Particularly, NiN filtering kernel is 1x1 window including only one weight
w so that the 1x1 NiN filtering operator is defined as follows:
For multi-channel input data, NiN filtering kernel is generalized as follows:
Given any two successive convolutional layers
X(s) and
X(s+1), there is a sub-network between the two c-layers, called multilayer perceptron convolutional network (
mlpconv) including some sub-layers called
micro-layer or micro-convolutional layer so that mlpconv is also called micro-network. Consequently, suppose a mlpconv has
n micro-layers
Y(1),
Y(2),…,
Y(n) whose input and output are the c-layer
X(s) and the c-layer
X(s+1), respectively, these micro-layers are determined by NiN filtering operator as follows (Lin, Chen, & Yan, 2014, pp. 3-4), (Gemini 2025):
where
denotes an element of the input c-layer
X(s) at its
kth channel and
denotes an element of the output c-layer
X(s+1) at its first channel whereas
denotes an element of the
rth micro-layer
Y(r). Elements of the output c-layer
X(s+1) at other channel are calculated by the same way with the first-channel elements
. Note,
u and
v are parametric weights whereas
α and
β are parametric biases. Obviously, pooling kernel in NiN is max-pooling kernel. The technique that NiN results out the maximum value after a series of filtering operators:
Which is called cross max-pooling. Especially, NiN does not apply dense network into classifying feature from mlpconv, instead, NiN makes sequentially filtering operators and cross max-pooling so that the outputs of last mlpconv (s) are summed together to produce the aggregated output that is actually the image classes and then, soft-max function and cross-entropy loss function are optimized so as to perform classification task on the aggregated output (Chen, et al., 2021, p. 12). This technique is called global average pooling (GAP).
GoogleLeNet also called Inception V1 developed by Szegedy et al. (Szegedy, et al., 2015) improved CNN based classifier through a far distance by proposing the concept of Inception model where Inception inherits from both the modular mechanism of VGG and the micro-filtering operator of NiN so that every Inception module has some convolutional layers which include: 1) normal filtering kernel with matrix kernel, 2) 1x1 micro-filtering kernel with one weight, 3) max-pooling kernel, and 4) global average pool (GAP). Especially, the most important aspect of Inception is that these convolutional layers are operated in parallel so that the final layer called filter concatenation layer will concatenate outputs of such convolutional layers in depth according to predefined number of channels. Finally, such concatenated output can be fed into dense network for classification, however, GAP which produces aggregated output can be the alternative of dense network. Therefore, Inception model establishes basic principles of Inception-based approach on CNN classification. From practical improvements, there is an important recognition that any CNN-based framework need the sparsity in its structure in order to converge and group better classes together, thus, Inception (GoogleLeNet) and NiN aims to decrease such sparsity by 1x1 micro-filtering and GAP (Chen, et al., 2021, p. 14). Moreover, Inception improve more the structural sparsity by parallel-processing convolutional layers, besides, GAP and cross max-pooling decrease data dimension and the number of parameters. Indeed, Inception requires less parameters than classic approaches like AlexNet and VGG but reaches better performance. Particularly, an Inception module has two layers and four branches, in which, the two layers are the structural viewpoint in depth and the four branches are the structural viewpoint in width. For instance, the first branch includes 1x1 micro-filtering kernel (belonging to the first layer) followed by 5x5 filtering kernel (belonging to the second layer). The second branch includes 1x1 micro-filtering kernel (belonging to the first layer) followed by 3x3 filtering kernel (belonging to the second layer). The third branch includes 3x3 max-pooling kernel (belonging to the first layer) followed by 1x1 micro-filtering kernel (belonging to the second layer). The fourth branch includes 1x1 micro-filtering kernel (belonging to the first layer). All of these kernels have stride 1, which is an aspect of Inception in order to keep the size (width and height) of data intact for making a so-called depth-wise concatenation to concatenate outputs of branches into one layer. For instances, let B1, B2, B3, and B4 whose dimensions are w x h x d1, w x h x d2, w x h x d3, and w x h x d4 be outputs of the fourth branches where d1, d2, d3, and d4 are depths (also the numbers of kernels) of the fourth branches, then the depth-wise concatenation also called filter concatenation will merge B1, B2, B3, and B4 into one output B whose dimension is w x h x (d1+d2+d3+d4). Based on definition of Inception module, the first version of Inception framework called Inception V1 known as GoogleLeNet has three parts (Gemini 2025) such as stem part (part A), stacked Inception modules part (part B), and classification head (part C). Part A has three layers where the first layer is convolutional layer with 7x7 filtering kernel (stride 2), the second layer for dimensionality reduction is micro-convolutional layer with 1x1 micro-filtering kernel, and the third layer is convolutional layer with 3x3 filtering kernel. Part B consists of 9 Inception modules aforementioned and so, part B has 18 layers because every Inception module has 2 layers. Part C is actually the dense network for classification with only one layer having 1000 classes (neurons), thus, GoogleLeNet has 22 layers.
Deep neural network (DNN) with many hidden layers will obtain better evaluation with high convergence but there is an effect-side that too many hidden layers cause the problem of overfitting where DNN gets involved in very few of output neurons, which decrease significantly accuracy in learning by neural network. The overfitting problem which occurs frequently in machine learning is now known as degradation problem in neural network. The main cause of overfitting problem is the event that gradient of activation function become too small nearly zero after a chain of gradient calculations by stochastic gradient descent (SGD) algorithm and backpropagation algorithm, which is known as
vanishing gradient problem. There are two popular approaches to solve vanishing gradient problem such as dropout (dilution) technique and residual connection (RC) technique. Kaiming He et al. (He, Zhang, Ren, & Sun, 2016) proposed RC technique and applied RC into training convolutional filtering kernel (weights) so as to produce a so-called ResNet for CNN-based classification. The significant effectiveness of ResNet makes itself become an CNN-based image approach known as residual approach. In general, RC is applied into any artificial neural network (ANN) besides CNN. For instance, given deep neural network denoted by function
F(
x | Θ) where
x is input layer and Θ denotes parametric weights, RC technique adds the input vector (input layer)
x into the
F(
x | Θ) such that:
Given the real output
y’ from environment, let
l(
y) be the likelihood for training the deep neural network
F(
x):
The association of stochastic gradient descent (SGD) algorithm and backpropagation algorithm is based on calculating the gradient of
l(
y) with respect to
y.
where
F’(
x | Θ) is the gradient / first-order derivative of
F(
x | Θ) with respect to
x.
The expression (1 + F’(x | Θ)) alleviates the vanishing gradient problem because the gradient ∇l(y) is never zero due to the expression (1 + F’(x | Θ)) although F’(x | Θ) may approach nearly zero. Note, F’(x | Θ) is often non-negative because the derivative of activation function like sigmoid function or rectified linear unit (ReLU) is often non-negative.
Attention is the concept to indicate the self-structure of a data entity, which is the starting point of self-supervised learning that is the middle one between supervised learning and unsupervised learning because the compared object in self-supervised learning is the source entity itself whereas supervised learning requires explicit output as compared object and unsupervised learning does not require compared object. The self-structure which is the output feature of self-supervised learning is often internal relationship or hidden meaning, which is generalized as attention in artificial intelligence and moreover, attention focuses on the focus of deep neural network. The concept “attention” is refined by transformer (Vaswani, et al., 2017) which will be mentioned later where transformer is also applied into CNN-based image classification. Firstly, attention-based approach in image classification began with the Residual Attention Network (RAN) proposed by Wang et al. (Wang, et al., 2017) in which Wang et al. (Wang, et al., 2017) proposed attention as product of source entity and the mask. Given input data x, let
H(
x) denote the output of
attention module which is the composition (product) of
trunk branch T(
x) and
mask branch M(
x):
When
residual learning is applied into learning attention module, the attention output
H(
x) is modified (Wang, et al., 2017) as follows:
When trunk branch
T(
x) is normal feed-forward network (FFN), mask branch
M(
x) is structured as encoder-decoder architecture where the encoder is called
bottom-up mask branch M1(
x) mainly applying max-pooling kernel into extracting image feature and the decoder is called
top-down mask branch M2(
x) mainly applying bilinear interpolation into recovering image from such encoded feature. The process of bottom-up mask branch is called
down-sampling and the process of top-down mask branch is called
up-sampling. Note,
M2(
x) connects sequentially with
M1(
x) to form
M(
x) where input of
M(
x) is actually input of
M1(
x) so that output of
M1(
x) becomes input of
M2(
x) and output of
M2(
x) is output of
M(
x). The product of trunk branch and mask branch to produce attention as
H(
x) = (1 +
M(
x)) is concretized given position (row
i, column
j) and channel
c as follows (Gemini 2025):
It is necessary to describe a little bit about bilinear interpolation operator which is the main task of up-sampling of top-down mask branch
M2(
x). The main task of linear interpolation is to recover the low-resolution feature (output of
M1(
x)) so as to be as near as possible to the high-resolution data representing image from trunk branch
T(
x). Given
w and
h are width and height of the output of
M1(
x) which is also input of
M2(
x) with note the output of
M1(
x) is result of max-pooling operator on
M1(
x) and given
W and
H are width and height of the output of
M2(
x) which is the focused result of bilinear interpolation so that the width scale
sw and height scale
sh are calculated as follows:
Given target position (
i’,
j’) on the output of
M2(
x) with plane of width
W and height
H, bilinear interpolation estimates the pixel value of the output of
M2(
x) called target value
x(
i’,
j’) from the pixel value of the input of
M2(
x) called source value
x(
i,
j) where the source position (
i,
j) on the plane of width
w and height
h is determined as follows:
The width fraction
fw and height fraction
fh are determined as follows:
where the notation
denotes the low integer. The four neighbors of source pixel are determined as follows (Gemini 2025):
The top interpolation value
vt and the bottom interpolation value
vb are calculated as follows (Gemini 2025):
As a result, target value
x(
i’,
j’) is estimated by bilinear interpolation as follows (Gemini 2025):
Note, target values x(i’, j’) move over the target plane of width W and height H.
Transformer-based approach whose typical model is ViT developed by Dosovitskiy et al. (Dosovitskiy, et al., 2020) is the recent approach for image classification, which is described shortly here with note that transformer is also developed as self-supervised learning model where attention mechanism is improved more than the mechanism of product of mask branch and trunk branch aforementioned. Note, ViT is abbreviation of vision transformer which means the application of transformer into vision computer domain like image processing. Transformer-based approach with ViT is most different from other CNN-based classification methods because attention of transformer can replace convolutional layers. Transformer aforementioned is firstly applied into natural language processing (NLP), especially statistical translation machine (STM) and large language model (LLM) although transfer learning and pre-trained model are also important fundamental of LLM. On the other hand, pre-trained model is also applied perfectly into domain of vision computer. Therefore, it is totally natural to apply transformer into computer vision, especially image classification. The only problem of transformer-based image classification is that the huge dimension of image data prevents transformer from effective computation. There are two solutions of the huge dimension problem (Gemini 2025): 1) image is divided into patches which are fed to transformer and 2) the auxiliary learnable parameter called class token is attached to the input of transformer (patch) and such digest which is actually weight vector becomes a digest of image after it is processed through transformer so that such image digest is input of classification network (dense network). That image digest is much smaller than image helps transformer to maintain effective computation. Note, image digest is actually evaluated class token and please pay attention that class token is randomly initialized only one time at the beginning of training process and it is not initialized again every time image patches are fed to transformer but it is updated (learned) through iterations of backpropagation algorithm like other weighted parameters.
For instance, given
A =
A(
X) is attention of input data
X which the most important part of transformer:
where
Q,
K, and
V are query matrix, key matrix, and value matrix, respectively.
Please pay attention that the three weight matrix
WQ,
WK, and
WV consist of the first part of parameters of transformer. The second part of parameter is the class token which attaches to input data
X. Note that:
Which implies that class token is the weighted vector
x1. Therefore, without loss of generality, we add class token
right before
x1 so that:
It is possible to denote:
Please pay attention that class token now is parametric weight vector whereas x2, x3,…, xm are patches of image. Therefore, it is necessary to describe shortly how to dissolve image into these n = m–1 patches x2, x3,…, xm. Suppose an image has width W and height H and every patch has width w and height h, then there are W/w * H/h patches so that n = W/w * H/h. For instance (Gemini 2025), given 224 x 224 x 3 image and given every patch has the size of 16 x 16 x 3, there are 196 = (224/16) * (224/16) patches so that every 16 x 16 x 3 patch is flattened into 768 x 1 vector due to 768 = 16 x 16 x 3. As result, the input data X has 197 vector and every vector has size of 768 elements in which the first vector is class token whose dimension is 768 x 1 too. Of course, X is 197 x 768 matrix.
In the next section, the backward bias/error
b(
X) which is the gradient of the likelihood
L(
A) with respect to input data
X is determined as following sum:
where,
It is interesting that the first row of the backward
b(
X) is actually the gradient of class token
and such first row is denoted
b(
X)[0].
As a result, parametric lass token is iteratively updated like other parameters by stochastic gradient descent as usual:
where
γ (0 <
γ ≤ 1) is learning rate.