Preprint
Article

This version is not peer-reviewed.

Is Matrix Neural Network the Alternative of Convolutional Neural Network?

Submitted:

01 April 2026

Posted:

02 April 2026

You are already at the latest version

Abstract
Currently (2025), deep learning is the most important and popular methodology in artificial intelligence (AI) and artificial neural network (ANN) is the foundation of deep learning. The main drawback of ANN is the boom problem of a huge number of parametric weights when ANN in deep learning establishes a large number of hidden layers. The boom problem can be alleviated by high-performance computer but will be serious in case of high-dimension input data like image. The excellent solution for image processing within context of deep learning is that large parametric weight vector is reduced into much smaller window encoded by a so-called filtering kernel which is often 3x3 matrix or 5x5 matrix which is convoluted over entire image data. ANN with support of such filtering kernel is called convolutional neural network (CNN). Many researches prove that CNN is feasible and effective in image processing. The hidden cause of the effectiveness of CNN is that the visionary structure of an image is aggregated in such a way that filtering kernel is ideal to extract image features. However, it is not asserted that matrix-based filtering kernel is appropriate to other high-dimension data that is not image. Another solution of the boom problem is that large parametric weight vector is organized as matrix that is the same structure of 2-dimension data like image, which leads to a so-called matrix neural network (MNN) whose parameters are weighted matrices. Computation cost of MNN is decreased significantly in comparison with ANN but it is necessary to test the effectiveness of MNN with respect to CNN. This is the main hypothesis “whether MNN is the alternative of CNN” which is tested in this research, hinted by the research title. Moreover, transformer which is the new trend (2025) in AI and deep learning, which aims to improve/replace traditional ANN by self-supervised learning, in which attention is the significant mechanism of self-supervised learning. Anyhow, attention which is the cornerstone of transformer is the representation of internal structure/relationship inside high-dimension data like image. Therefore, the implicit deep meanings of attention and filtering kernel are similar, which represents feature of data, which does not go beyond parametric weights too. In general, the research has two goals: 1) explaining and implementing ANN, CNN, and transformer (attention) and 2) applying analysis of variance (ANOVA) into evaluating the effectiveness of ANN, CNN, and transformer (attention) within context of image classification. The ultimate result is that it is not asserted that MNN is the alternative of CNN but MNN can be an optional choice for implementing ANN in context of image processing instead of focusing on the unique CNN solution. Moreover, the incorporation of MNN and attention in implementing transformer produces a compromising solution of high performance and computational cost.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Machine learning is classified into three subjects such as supervised learning, unsupervised learning, and reinforcement learning, in which self-supervised learning is the intermediate one between supervised learning and unsupervised learning. Artificial intelligence (AI), which is the sub-domain of machine learning, can be considered as the core of machine learning, where artificial neural network (ANN) is the preeminent approach built in AI. Fortunately, ANN supports fully the four subjects such as supervised learning, self-supervised learning, unsupervised learning, and reinforcement learning. ANN which simulates human neural network consists of one input layer, many enough hidden layers, and one output layer so that the information is entered input layer, then is propagated through hidden layers, and finally evaluated at output layer which is the result of mentioned machine learning approaches, according to a so-called propagation rule. For instance, given layer k specified by vector variable xk = (xk1, xk2,…, xkn)T is evaluated by the previous layer xk–1 by propagation rule as follows:
x k = f x ^ k 1 = f W k x k 1 + Θ k
x ^ k 1 = W k x k 1 + Θ k
Or shortly,
x k = f x ^ k 1 = W k x k 1 + Θ k
Note, Wk and Θk are weight parameter and bias parameter at layer k, where Wk is weight matrix and is Θk bias vector as usual. The function f(.) is called activation function which is often squash function like sigmoid function. If it focuses on the particular kth layer, it will be denoted as f x k x ^ k 1 . Training ANN is to estimate the parameters Wk and Θk and a popular training method is the association of stochastic gradient descent (SGD) algorithm and backpropagation algorithm, based on predefined likelihood function or predefined error function. When the number of hidden layers is large enough, ANN is called deep neural network (DNN) which is the basic of deep learning. The common representation of artificial neural network (ANN) is feedforward network (FFN) in which the input layer is fed forwards so as to evaluate the output layer and so, ANN mentioned in this research is FNN if there is no additional explanation. Given FFN with vector layers represented by a series of layers x0, x1, x2,…, xK is called K-layer FFN which is represented as follows:
x k = f x ^ k 1 = W k x k 1 + Θ k , k = 1 , K ¯
Classification is the most popular method belonging to supervised learning, which is implemented perfectly by artificial neural network (ANN). As usual, training FFN classifier (ANN classifier) aims to minimize the so-called cross-entropy loss function loss(x) given variable vector x = (x1, x2,…, xn)T being the output layer xK=x. For instance, given the last output vector xK = (xK1, xK2,…, xKn)T computed from ANN and the real class probability p = (p1, p2,…, pn)T of such vector xK, the cross-entropy loss function loss(xK) is specified as follows:
l o s s x K = i = 1 n p i l o g p ^ i
where,
p ^ i = s o f t m a x x K i = e x p x K i l = 1 n e x p x K l
So that:
W K = argmin W K l o s s x K
Θ K = argmin Θ K l o s s x K
The parameters WK and ΘK at the last layer K are estimated by stochastic gradient descent (SGD) algorithm as follows:
W K = W K + γ W k l
Θ K = Θ K + γ Θ k l
where γ (0 < γ ≤ 1) is learning rate. Note, W K l and Θ K l are gradients of loss(xK) with respect to WK and ΘK, respectively, which are determined based on the cross-entropy gradient ∇loss(x) that is calculated as following row vector:
l o s s x = l o s s x x 1 , l o s s x x 2 , , l o s s x x n = l o s s x x j j = 1 , n ¯
where l o s s x x j is the partial derivative of loss(x) with respect to xj as follows:
l o s s x x j = i = 1 n p i l o g p ^ i x j = i = 1 n p i p ^ i d p ^ i d x j
where d p ^ i d x j is derivative of real probability p ^ i with respect to xj as follows:
d p ^ i d x j = d e x p x i l = 1 n e x p x l d x j = 1 l = 1 n e x p x l 2 d e x p x i d x j l = 1 n e x p x l e x p x i l = 1 n d e x p x l d x j
If i=j, we have:
d p ^ i d x j = d p ^ i d x i = e x p x i l = 1 n e x p x l e x p x i l = 1 n e x p x l 2 = p ^ i 1 p ^ i
If ij, we have:
d p ^ i d x j = d p ^ i d x i = e x p x i l = 1 n e x p x l e x p x j l = 1 n e x p x l = p ^ i 0 p ^ j
This implies the derivative d p ^ i d x j of real probability p ^ i with respect to xj is specified as follows:
d p ^ i d x j = p ^ i δ i j p ^ j
where,
δ i j = 1   i f   i = j 0   i f   i j
Therefore, the partial derivative of loss(x) with respect to xj is determined as follows:
l o s s x x j = i = 1 n p i p ^ i p ^ i δ i j p ^ i = i = 1 n p i δ i j p ^ j
As a result, cross-entropy gradient ∇loss(x) is totally determined as follows:
l o s s x = i = 1 n p i δ i j p ^ j j = 1 , n ¯ = i = 1 n p i δ i 1 p ^ 1 i = 1 n p i δ i 2 p ^ 2 i = 1 n p i δ i n p ^ n T
where,
δ i j = 1   i f   i = j 0   i f   i j p ^ j = s o f t m a x x j = e x p x j l = 1 n e x p x l
So that:
W K l = Θ K l x K 1 T Θ K l = f x K ' x ^ K 1 T l o s s x T = f x K ' x ^ K 1 T i = 1 n p i δ i j p ^ j j = 1 , n ¯ T
Note, the superscript “T” denotes transposition operator of vector and matrix. By association of SGD and backpropagation algorithm, all gradients W k l and Θ k l of all layers from k=1 to k=K are determined as follows:
W k l = Θ k l x k 1 T , k = 1 , K ¯ Θ K l = f x K ' x ^ K 1 T i = 1 n p i δ i j p ^ j j = 1 , n ¯ T Θ k l = f x k ' x ^ k 1 T W k + 1 T Θ k + 1 l   i f   k = 1 , K 1 ¯
So that ANN classifier are totally trained by estimating their parameters Wk and Θk from k=1 to k=K as follows:
W K = W K + γ W k l
Θ K = Θ K + γ Θ k l
Data classification become extremely hazard when the data is image which is 2-dimension data, which raises the problem of image classification consisting of two issues: 1) context meaning of image is too difficult to be caught and 2) high-resolution image causes the so-called boom problem of a huge number of parameters so that training ANN consumes a lot of time and resources. Convolutional neural network (CNN) is invented to solve such boom problem, in which an image is filtered via a so-called filter so as to extract the feature of such image, then such feature is fed as input of ANN for classification. The feature extraction has two goals: 1) dimension of image data space is reduced significantly by CNN filter so as to make ANN classifier to be feasible in training and 2) feature which is the essential aspect of image such as edge and pattern makes the image classification more accurate. Especially, parameter size in CNN which is the size of kernel matrix is much smaller than the size of weight matrix in ANN, which decreases significantly computational cost. In general, CNN is an excellent approach to image classification and so, it is necessary to sketch its methodology so that it is possible to compare matrix neural network (Gao, Guo, & Wang, 2016) and CNN. Essentially, convolutional neural network (CNN) has two connected parts: 1) the first part called convolutional network consists of a set of sequential convolutional layers which of them is filtered by a kxk filter and 2) the second part called fully connected network or dense network takes output of convolutional network as its input feature and then performs classification task on such feature with note that the output of convolutional network is image feature, essentially. Such classification task has just described but please pay attention that the main point of CNN is convolutional network (not dense network). Shortly, the input and output of convolutional network are image and image feature, respectively whereas the input and output of dense network are image feature and image class, respectively. Dense network is feedforward network (FFN) as usual whereas each convolutional layer is a 2-dimension data/feature that was filtered by a filter. Given an mxn image as mxn matrix and given kxk filter then the corresponding convolutional layer is reduced as (m/k)x(n/k) matrix after filtering, which means that image size is decreased k2 times with kxk filter. For instance, given an image as 2-dimension mxn matrix denoted X = (xij)mxn, a filter also called kernel or filtering kernel is a kxk square matric denoted W = (wij)kxk, which is often 3x3 or 5x5 matrix. Given pixel X[i][j] = xij = xi,j at the ith row and jth column of matrix X, the respective value y which is a cell of convolutional layer Y, which is called convolutional value or convolutional cell, is calculated by a so-called filter operator which is the sum of multiplications of neighbor pixels and W = (wij)kxk.
y i j = u , v w u v x i + u , j + v + θ = u = 1 k v = 1 k w u v x i + u , j + v + θ
Both W and θ are parameters of convolutional layer where W is called filter weight matrix and θ is called filter bias. Activation function f(.) can be applied into filter operator like ANN does:
y i j = f x ^ i j = u , v w u v x i + u , j + v + θ
But the common activation function for CNN is Rectified Linear Unit (ReLU) which limits its input in the interval [a, b].
R e L U x = m a x a , m i n b ,   x
Indeed, the filter W is slid over entire image X so as to complete convolutional layer Y and so, the movement of filter is jumped step-by-step where the step length is called stride denoted s. As usual, stride s is in the range [1, k] such that 1 ≤ sk where k is often called filter size for filter W = (wij)kxk. If image X is nxn matrix and filter W is kxk matrix given stride s, then the size o of convolutional layer Ypxp is:
o = 1 + n k s
When the filtering kernel W move directionally to the edge (left column) of image, there are some residual pixels that are not fit totally to W and so, these pixels are called padding pixels. As usual, padding pixel is filled by zero so as to form the so-called zero-padding technique. Let p be the number of padding columns, the size o of convolutional layer is modified a little bit as follows:
o = 1 + n k + 2 p s
The effectiveness of a CNN depends on how to specify the set of sequential convolutional layers associated with their filters so that there are many scientific researches that defines very appropriate filters to extract (image) features as much as possible. However, this research focuses on specifying general filters and how to train such filters in order to generalize CNN in comparison with matrix neural network mentioned later. Therefore, it is necessary to skim over convolutional filter training (Gemini 2025). Given K-layer CNN whose layers are X0, X1, X2,…, XK where each convolutional layer Xk has parametric filter weight matrix W(k) and parametric filter bias θ(k).
x i j k = f x ^ i j k 1 = u , v w u v k x i + u , j + v k 1 + θ k
Note,
x i j k X k x i + u , j + v k 1 X k 1 w u v k W k
Let l x i j K be the likelihood function which takes x i j K as its input at the last layer XK, as usual, l x i j K is the negative of error function:
l x i j K = ε x i j K
Let l(XK) be the entire likelihood function which is the mean of l x i j K over all points x i j K (s) belonging to XK.
l X K = 1 X K i , j l x i j K = 1 X K i j l x i j K
Or,
l X K = i , j l x i j K = i j l x i j K
Where the notation |XK| denotes the size of layer XK, for instance, if XK is mxn matrix, then its size |XK| is m*n. Training CNN within association of stochastic gradient descent (SGD) algorithm and backpropagation algorithm is to estimate parameters W(k) and θ(k) of all convolutional layers. The parameters W(k) and θ(k) of the last layer XK are estimated firstly:
W K = argmax W K l X K θ K = argmax θ K l X K
According SGD, parametric filter wuv(K) and parametric bias θ(K) are estimated iteratively by addition of itself and gradients of likelihood function l(XK) with respect to wuv(K) and θ(K), respectively, increased by step of learning rate γ. As a result, training CNN is essentially to calculate these gradients. For last layer XK, we have:
w u v K = w u v K + γ w u v K l X K θ K = θ K + γ θ K l X K
where,
w u v K l = w u v K l X K = i : u , j : v w u v K l x i j K θ K l = θ K l X K = i : u , j : v θ K l x i j K
Please pay attention that the notation “i:u, j:v” implies that indices i and j are iterated given indices u and v where every product x ^ i j K 1 = u ' , v ' w u ' v ' K x i + u ' , j + v ' K 1 is satisfied so that there is only one pair of u=u’ and v=v’. Let w u v K l x i j K and θ K l x i j K be gradients of likelihood function l(zij(K)) with respect to wuv(K) and θ(K), respectively, as follows (Gemini 2025):
w u v K l x i j K = d l x i j K d w u v K = d l x i j K d x i j K d x i j K d x ^ i j K 1 x ^ i j K 1 w u v K = l ' x i j K f ' x ^ i j K 1 x i + u , j + v K 1
θ K l x i j K = d l x i j K d θ K = d l x i j K d x ^ i j K 1 = d l x i j K d x i j K d x i j K d x ^ i j K 1 x ^ i j K 1 θ K = l ' x i j K f ' x ^ i j K 1
We obtain the equation to learn parametric filter wuv(K) and parametric bias θ(K) at output layer XK.
w u v K l = i : u , j : v x i + u , j + v K 1 θ K l x i j K θ K l = i , j θ K l x i j K θ K l x i j K = f ' x ^ i j K 1 l ' x i j K
By generalization, parametric filter wuv(k) and parametric bias θ(k) at any layer Xk where 1 ≤ kK are totally determined as follows:
w u v k l = i : u , j : v x i + u , j + v k 1 θ k l x i j k θ k l = i , j θ k l x i j k
The quantity θ k l x i j k is also called elemental parametric error at any layer Xk where 1 ≤ kK. Although it is determined given at the last layer XK as:
θ K l x i j K = f ' x ^ i j K 1 l ' x i j K
It is necessary to generalize it at layer Xk where 1 ≤ kK. Indeed, this elemental parametric error will be propagated backward from layer XK back to layer XK–1 and so, let x i + u , j + v K 1 l x i j K 1 be the error of every element at layer XK–1 (Gemini 2025):
x i + u , j + v K 1 l x i j K 1 = d l x i j K d x i + u , j + v K 1 = d l x i j K d x i j K d x i j K d x ^ i j K 1 x ^ i j K 1 x i + u , j + v K 1 = l ' x i j K f ' x ^ i j K 1 w u v K = θ K l x i j K w u v K
In practice, the error of every element xij at layer XK–1 is the sum of x i + u , j + v K 1 l x i j K 1 over kernel W(K) as follows:
x i j K 1 l x i j K 1 i ' : i , j ' : j θ K l x i ' j ' K w u ' v ' K
Please pay attention that the notation “i’:i, j’:j” implies that indices i’ and j’ are iterated given indices i and j where every product x ^ i ' j ' K 1 = u ' , v ' w u ' v ' K x i ' + u ' , j ' + v ' K 1 is satisfied so that there is only one pair of u=u’ and v=v’ such that i = i’+u’ and j = j’+v’. The elemental parametric error is generalized at any layer Xk where 1 ≤ kK as follows:
x i j k l x i j k l ' x i j K   i f   k = K i ' : i , j ' : j w u ' v ' k + 1 θ k + 1 l x i ' j ' k + 1   i f   k < K
In general, necessary gradients for training CNN are summarized from k=K down k=1 as follows:
w u v k l = i : u , j : v x i + u , j + v k 1 θ k l x i j k θ k l = i , j θ k l x i j k θ k l x i j k = f ' x ^ i j k 1 x i j k l x i j k x i j k l x i j k l ' x i j K   i f   k = K i ' : i , j ' : j w u ' v ' k + 1 θ k + 1 l x i ' j ' k + 1   i f   k < K
Such that:
w u v k = w u v k + γ w u v k l θ k = θ k + γ θ k l
where γ (0 < γ ≤ 1) is learning rate.
There are special filters called pooling filters without parameters, in which max-pooling filter is a popular pooling filter. Although max-pooling filter is simple to find out maximum value in the bound of uxv window, it is effective to extract image feature as a set of maximum-value pixel.
x i j k = max u , v x i + u , j + v k 1
It is only necessary to calculate the elemental parametric error for max-pooling filter so that backpropagation algorithm only sends backwards such error, as follows:
x i j k l x i j k l ' x i j K   i f   k = K i ' , j ' x i ' j ' k + 1 l x i ' j ' k + 1   i f   x i j k   i s   x i ' j ' k + 1 0   o t h e r w i s e   i f   k < K
Please pay attention that the indices i’ and j’ are iterated so that the max-pooling operator x i ' j ' k + 1 = max u , v x i ' + u , j ' + v k is satisfied.
CNN is extended with multi-channel layers in which each layer Xk has C channels whereas filter weight matrix has C channels too but parametric filter bias is kept intact so that the propagation is extended a little bit as follows:
x i j c k = f x ^ i j c k 1 = c = 1 C u , v w u v c k x i + u , j + v , c k 1 + θ k
where pixel values x i j c k and weights w u v c k are indexed more by channel c. The likelihood function l x i j K is extended as follows:
l x i j K = c = 1 C l x i j c K
where,
l x i j c K = ε x i j c K
Process of training CNN is not changed but its estimation equations are slightly changed, as follows:
w u v c k l = i : u , j : v x i + u , j + v , c k 1 θ k l x i j c k θ k l = i , j , c θ k l x i j c k θ k l x i j c k = f ' x ^ i j c k 1 x i j c k l x i j c k x i j c k l x i j c k l ' x i j c K   i f   k = K u : i , v : j w u v c k + 1 θ k + 1 l x i j c k + 1   i f   k < K
The core last convolutional error  x i j K l x i j K = l ' x i j K at the output layer XK is propagated backward from dense network. Let X K l X K denote the core last convolutional error (matrix) including all x i j K l x i j K at the last convolutional layer XK. Suppose the N-layer dense network has N layers such as y0, y1, y2,…, yN, the last convolutional error X K l X K is translated as the bias of the dense network at its first layer y0, propagated from the core last bias  b y N y N as follows:
X K l v e c X k = n = 1 N 1 g y n ' y ^ n 1 T W n + 1 T g y N ' y ^ N 1 b y N y N
where vec(.) represents vectorization technique which makes a matrix flatten as column vector, g(.) is activation function of dense network, and Wn is parametric weight matrix of dense layer n. If dense network is trained based on minimizing squared norm function, the core last bias b y N y N is easily determined as follows:
b y N y N = y N ' y N
where yN is the real output of dense network from environment. Please pay attention that when x i j K l x i j K = l ' x i j K is core last convolutional error, the gradient θ k l x i j k in general is called convolutional error, backward error, or backward bias at the kth convolutional layer, which is the cornerstone of association of backpropagation algorithm and stochastic gradient descent (SGD) algorithm because the gradient θ k l x i j k is also the first-order derivative of convolutional likelihood l(xij(K)) with respect to x ^ i j k 1 at layer Xk where all elements x ^ i j k 1 represent the layer Xk as variable.
θ k l x i j k = d l x i j K d θ k = d l x i j K d x ^ i j k 1
Although it is no doubt about the preeminence of convolutional neural network (CNN) in training image data, it is not meaningful enough to expend CNN into other high-dimension data different from image because CNN filter is appropriate to extract image feature. Moreover, data is changed or distorted after it goes through CNN so that original data with complete or full meaning is not kept intact within CNN methodology. Therefore, this research focuses on evaluating the hypothesis “whether matrix neural network is the alternative of convolutional neural network” and so, image classification by CNN and matrix neural network called deep image classification is applied into testing the hypothesis. The next section mentions some other works related to deep image classification and the later section describes the main methodology of this research which focuses on matrix neural network.

3. Matrix Neural Network

The common representation of artificial neural network (ANN) is feedforward network (FFN) in which the input layer is fed forwards so as to evaluate the output layer. If FFN has many hidden layers between input layer and output layer, it will be called deep neural network (DNN). As usual, every DNN layer is coded as a vector, which raises the so-called boom problem of a huge number of parameters, for example, given two successive layers are coded as m*n-element vector and p*q-element vector, then there are m*n*p*q parametric weights, which is the problem of quadratic complexity, indeed. Matrix neural network (MNN) aims to solve the boom problem by coding layers as matrix and coding parameters as matrix too. The MNN mentioned here is FNN with matrix layers, which is represented by a series of matrix layers X0, X1, X2,…, XK that is called K-layer FFN as follows:
X k = f X ^ k 1 = U k X k 1 V k + Θ k , k = 1 , K ¯
Each matrix layer Xk represented by a matrix Xk has three parameters such as parametric weight matrix Uk, parametric weight matrix Vk, and parametric bias matrix Θk, which aims to solve the boom problem of a huge number of parameters because weigh matrices are always keep intact their size in squared number. For emphasizing on the particular layer Xk, the activation function f X ^ k 1 is denoted as   f X k X ^ k 1 . Training MNN is to estimate parameters Uk, Vk, and Θk by maximizing the likelihood function l(XK) at the last output layer XK which is defined as the negative of squared Frobenius norm given the real output XK’.
l X K = 1 2 X K ' X K 2
where,
X K ' X K 2 = i j x K i j ' x K i j 2 l ' X K = X K l X K = X K ' X K
So that:
U k = argmax U k l X K V k = argmax V k l X K Θ k = argmax Θ k l X K , k = K , K 1 , , 1
The optimization problem is solved iteratively by association of stochastic gradient descent (SGD) algorithm and backpropagation algorithm. Particularly, parametric weight matrices Uk, Vk and parametric bias matrix Θk at kth layer are estimated iterative by SGD as follows:
U k = U k + γ U k l V k = V k + γ V k l Θ k = Θ k + γ Θ k l , k = 1 , K ¯
where γ (0 < γ ≤ 1) is learning rate and U k l , V k l , and Θ k l are gradients or first-order derivatives of likelihood l(XK) with respect to Uk, Vk, and Θk, respectively. Therefore, how to train MNN is essentially to calculate the parametric weight gradients U k l , V k l and parametric bias gradient Θ k l . Note:
U k l = U k l X k V k l = V k l X k Θ k l = Θ k l X k , k = 1 , K ¯
Because high dimensions of gradients related to tensors, vectorization technique is applied into calculating these gradients so that matrix gradients are equivalent to vector gradients as follows:
U k l X k U k l v e c X K V k l X k V k l v e c X K Θ k l X k Θ k l v e c X K , k = 1 , K ¯
As a convention, we denote:
U k l U k v e c l V k l V k v e c l Θ k l Θ k v e c l , k = 1 , K ¯
Where the notation vec(.) denotes vectorization operator that convert a matrix into single column vector without information loss. Let X K l X K be the gradient of likelihood with respect to XK, which is determined as follows:
X K l v e c X K = v e c X K l X K = v e c X K ' X K
Differential dl of likelihood with respect to XK–1 within vectorization is:
d v e c l X K = d l X K = d l v e c X K
D u e t o   X K ' X K 2 = v e c X K ' v e c X K 2 = v e c X K ' X K 2
= t r l v e c X K T d v e c X K
(This is the important property making the relationship between gradient and trace operator)
= t r v e c X K ' v e c X K T d v e c X K
D u e t o   X K ' X K 2 = v e c X K ' v e c X K 2
= t r v e c X K ' X K T v e c d X K = t r v e c X K ' X K T v e c d f X K X ^ K 1
= t r v e c X K ' X K T d v e c f X K X ^ K 1
Note, the superscript “T” denotes transposition operator of vector and matrix whereas tr(.) denotes trace operator which takes the sum of diagonal elements of square matrix. Moreover, we can denote:
d l = d v e c l = d v e c l X K
Due to:
d v e c f X K X ^ K 1 = d v e c f X K U K X K 1 V K + Θ K = d f X K v e c U K X K 1 V K + Θ K = f X K ' v e c X ^ K 1 V K T U K v e c d X K 1 = f X K ' v e c X ^ K 1 V K T U K d v e c X K 1
Which implies:
d v e c l X K = t r v e c X K ' X K T f X K ' v e c X ^ K 1 V K T U K d v e c X K 1
Note, the notation denotes Kronecker product. Please pay attention that f X K ' v e c X ^ K 1 is the derivative of the activation function f X K v e c X ^ K 1 with respective to v e c X ^ K 1 and the following property related to derivative of vec(.) variable, for instance, given matrix variable X and two matrix constants A and B, we have the very important property as follows:
d v e c A X B = v e c d A X B = B T A v e c d X = B T A d v e c X
On the other hand, we have:
d v e c f X ^ K 1 = d v e c f X K U K X K 1 V K + Θ K = f X K ' v e c X ^ K 1 d v e c Θ K
Such that likelihood differential dl with respect to last bias parameter ΘK is:
d v e c l X K = t r v e c X K ' X K T f X K ' v e c X ^ K 1 d v e c Θ K
In other words, the gradient of last bias parameter ΘK is:
Θ K v e c l = d v e c l X K d v e c Θ K = f X K ' v e c X ^ K 1 T v e c X K ' X K
(This is the important property making the relationship between gradient and trace operator)
Let I r U K be the identity matrix whose dimension is r(UK) x r(UK) where r(UK) is the number of rows of UK. We also have:
d v e c f X ^ K 1 = d v e c f X K U K X K 1 V K + Θ K = f X K ' v e c X ^ K 1 X K 1 V K T I r U k d v e c U K
Such that likelihood differential dl with respect to last weight parameter UK is:
d v e c l X K = t r v e c X K ' X K T f X K ' v e c X ^ K 1 X K 1 V K T I r U K d v e c U K
In other words, the gradient of last weight parameter UK is:
U K v e c l = d v e c l X K d v e c U K = X K 1 V K T I r U K T f X K ' v e c X ^ K 1 T v e c X K ' X K = X K 1 V K T I r U K T Θ K v e c l
Let I c V K be the identity matrix whose dimension is c(VK) x c(VK) where c(VK) is the number of columns of UK. Similarly, likelihood differential dl with respect to last weight parameter VK is:
d v e c l X K = t r v e c X K ' X K T f X K ' v e c X ^ K 1 I c V K U K X K 1 d v e c V K
In other words, the gradient of last weight parameter VK is:
V K v e c l = d v e c l X K d v e c V K = I c V K U K X K 1 T f X K ' v e c X ^ K 1 T v e c X K ' X K = I c V K U K X K 1 T Θ K v e c l
Moreover, likelihood differential dl is rewritten according to bias gradient as follows:
d v e c l X K = t r v e c X K ' X K T f X K ' v e c X ^ K 1 V K T U K d v e c X K 1 = = t r Θ K v e c l T V K T U K d v e c X K 1
By generalizing differential likelihood dl with respect to any kth layer for all 1 ≤ k < K within vectorization, we obtain:
U k v e c l = X k 1 V K T I r U k T Θ k v e c l V k v e c l = I c V k U k X k 1 T Θ k v e c l d v e c l = t r Θ k + 1 v e c l T V k + 1 T U k + 1 d v e c X k
Because the differential of vec(Xk) is always determined with respect to any bias parameter Θk for all 1 ≤ k < K.
d v e c X k = d v e c f X ^ k 1 = f X k ' v e c X ^ k 1 d v e c Θ k
Such that the likelihood differential dl is rewritten with respect to any bias parameter Θk for all 1 ≤ k < K.
d v e c l X K = t r Θ k + 1 v e c l T V k + 1 T U k + 1 f X k ' v e c X ^ k 1 d v e c Θ k
It is totally to determine any gradient of Θk for all 1 ≤ k < K.
Θ k v e c l = d v e c l X K d v e c Θ k = f X k ' v e c X ^ k 1 T V k + 1 T U k + 1 T Θ k + 1 v e c l
As a result of association of stochastic gradient descent (SGD) algorithm and backpropagation algorithm for training matrix neural network (MNN), all gradients U k v e c l , V k v e c l , and Θ k v e c l of Frobenius norm likelihood with respect to parameters Uk, Vk, and Θk within vectorization technique for all 1 ≤ kK, respectively are calculated as follows:
U k v e c l = X k 1 V k T I r U k T Θ k v e c l V k v e c l = I c V k U k X k 1 T Θ k v e c l Θ k v e c l = f X k ' v e c X ^ k 1 T V k + 1 T U k + 1 T Θ k + 1 v e c l Θ K v e c l = f X K ' v e c X ^ K 1 T v e c X K ' X K
Such that:
v e c U k = v e c U k + γ U k v e c l v e c V k = v e c V k + γ V k v e c l v e c Θ k = v e c Θ k + γ Θ k v e c l
Please pay attention that functions r(A) and c(A) returns the number of rows and the number of columns for matrix A. Therefore, I r U k is identity matrix whose dimension is r(Uk)xr(Uk) and I c V k is identity matrix whose dimension is c(Vk)xc(Vk). If Uk is ignored, it is identity matrix I r X k . If Vk is ignored, it is identity matrix I c X k . Although MNN is not new, please read the research paper “Matrix Neural Networks” (Gao, Guo, & Wang, 2016) for comprehending more about MNN called often matrix-valued neural network in literature, this research contributes to reproduce the succinct and practical equations above to train MNN.
In practical calculation, the estimation equations are not significantly simpler than traditional ANN training because of vectorization operator and Kronecker product. Fortunately, wise-multiplication is applied into reducing computational cost as follows:
U k l = v e c 1 X k 1 V k T I r U k T v e c Θ k l V k l = v e c 1 I c V k U k X k 1 T v e c Θ k l Θ k l = f X k ' X ^ k 1 v e c 1 V k + 1 T U k + 1 T v e c Θ k + 1 l Θ K l = f X K ' X ^ K 1 X K ' X K
Such that:
U k = U k + γ U k l V k = V k + γ V k l Θ k = Θ k + γ Θ k l
Where notation denotes wise-multiplication and notation vec–1(.) denoted the inverse vectorization operator that converts reversely a single column vector into original matrix. Note, given two vectors (matrices) whose have the same dimension, the wise-multiplication operator between them produces a new vector (matrix) whose every element is result of multiplication of one element from a vector (matrix) and one element from another vector (matrix) with constraint that the two elements have the same index in their own vectors (matrices). Please pay attention that f X k ' X ^ k 1 is not the derivative of the activation function f(.) with respective to X ^ k 1 , indeed, which is the matrix of taking derivatives of all elements of X ^ k 1 , which means that:
f X k ' X ^ k 1 = f ' X ^ = f ' x ^ 11 f ' x ^ 12 f ' x 1 n ' f ' x 21 ' f ' x 22 ' f ' x 2 n ' f ' x m 1 ' f ' x m 2 ' f ' x m n '
For example, when f(.) is sigmoid function   f x ^ i j = 1 1 + e x ^ i j , its derivative on x ^ i j is:
f ' x ^ i j = f x ^ i j 1 f x ^ i j
Moreover, Kronecker product is easily optimized in calculation by programming technique. For example, given:
U k l = v e c 1 X k 1 V k T I r U k T v e c Θ k l
It is necessary to illustrate how to calculate:
v e c 1 A T B T v e c C
where A, B, and C are following matrices.
A = X k 1 V k B = I r U k C = v e c Θ k l
Indeed, we have:
v e c 1 A T B T v e c C = v e c 1 A B T v e c C
Suppose:
A = a i j m x n = a 11 a 1 n a m 1 a m n , B = b i j p x q = b 11 b 1 q b p 1 b p q
Let Pi be the matrix whose every element is multiplication of matrix BT and every element of the ith row of matrix A:
P i = A i B T = a 11 B T a 1 n B T = a 11 b 11 a 11 b p 1 a 1 n b 11 a 1 n b p 1     a 11 b 1 q a 11 b p q a 1 n b 1 q a 1 n b p q
Of course, dimension of Pi is q x n*p. It is easy to multiply every Pi with vec(C) where vec(C) is the n*p x 1 vector to produce every column vector Qi whose dimension is q x 1.
Q i = P i v e c C
The final result is the vertical concatenation of m vectors Qi, which is q x m matrix as follows:
Q 1 , Q 2 , , Q m = v e c 1 A T B T v e c C = v e c 1 X k 1 V k T I r U k T v e c Θ k l = U k l
The gradient Θ k v e c l of the bias Θk at the layer Xk is called the kth bias is extended as follows:
b X k v e c X k = Θ k v e c l = l = k K 1 f X l ' v e c X ^ l 1 T V l + 1 T U l + 1 T f X K ' v e c X ^ K 1 T b X K v e c X K
where the quantity b X K v e c X K is called the core last bias which was calculated by the likelihood based on squared Frobenius norm.
b X K v e c X K = X K ' X K
If MNN is applied into aforementioned classification task in order to derive MNN classifier, it is simpler to modify the core last bias to minimize the cross-entropy loss function as follows:
b X K v e c X K = i = 1 n p i δ i j p ^ j j = 1 , n ¯ T
where,
δ i j = 1   i f   i = j 0   i f   i j
Note, the vector p = (p1, p2,…, pn)T is the real class probability of the last layer XK = xK such that
p ^ i = s o f t m a x x K i = e x p x K i l = 1 n e x p x K l
Please pay attention that it is totally possible to remove complexity of Kronecker product and vectorization technique by association of programming technique and independent aspect of activation function on matrix. Moreover, the kth bias b X k v e c X k = Θ k v e c l which is also called backward bias or backward error at kth layer is the cornerstone of association of backpropagation algorithm and stochastic gradient descent (SGD) algorithm because the gradient (bias) Θ k v e c l is also the first-order derivative of training likelihood l(XK) with respect to X ^ k 1 at the layer Xk.
Θ k v e c l = d l X K d v e c Θ k = d l X K d v e c X ^ k 1
Transformer developed by Vaswani et al. (Vaswani, et al., 2017) is invented to improve deep neural network (DNN) so that it discovers the self-structure or the internal feature of input for DNN. Exactly, given input X = X0 of DNN, self-structure of X which is internal structure of X is known as self-attention or attention in context of transformer. In other words, transformer extracts such attention denoted A = A(X) so that DNN processes the finer feature A instead of processing the coarser input X. Extracting attention is indeed self-supervised learning which is the intermediate one between supervised learning and unsupervised learning. In practice, transformer is associated with feedforward network (FFN) in order to establish a complex transformer which aims to provide the self-supervised learning mechanism to artificial neural network, which is actually a revolution in domain of artificial intelligence (AI) when artificial neural network (ANN) supports fully four large subjects of machine learning such as supervised learning, self-supervised learning, unsupervised learning, and reinforcement learning. Note that ANN supports Q-learning which is a popular reinforcement learning method. As a convention, complex transformer which is the association of transformer and FFN where transformer is the first part and FFN is the second part is known as transformer as usual. At the other hand, transformer built in its core “attention” without FFN is also a different methodology from traditional ANN because the propagation rule of attentional transformer A = s o f t m a x Q K T d k V mentioned later is an improvement of the propagation rule of ANN X k = f X ^ k 1 = U k X k 1 V k + Θ k . Actually, the transformer-based propagation rule A = s o f t m a x Q K T d k V is more complicated and clever than the ANN-based propagation rule X k = f X ^ k 1 = U k X k 1 V k + Θ k with note that the soft-max function s o f t m a x Q K T d k is also a perfect squash activation function. Therefore, it is totally to connect many attentions as attentional layers together by the same way that ANN connects its neuron layers together. The reason that the FFN connects the attention A(X) for constituting a complex transformer is to fine-tune the attention A(X), which is the second viewpoint that is opposite to the aforementioned first viewpoint that the attention A(X) aims to fine-tune the coarse input data X into finer feature A(X).
The essence of convolutional neural network (CNN) is to extract feature of image whereas the essence of transformer-based attention discovering is to extract feature of general data in matrix form. That is the reason of the question “Is matrix neural network the alternative of convolutional neural network?” which is the main hypothesis of this research, which also implies that whether attention extracted by transformer can be the alternative of convolutional layer resulted by CNN when attention of transformer is always based on matrix. The first hazard problem is that CNN processes fast convolutional layer because of the implication of filter operating by small size of kernel whereas there is the aforementioned boom problem of ANN depends on flattening high-dimensional matrix-form data like image into vector-form input layer which cause the aforementioned problem of quadratic complexity. Fortunately, the first hazard problem is solved exactly by matrix neural network (MNN) where parameters of MNN are in matrix form. As a result, the fact that FFN of transformer is MNN is application of MNN into transformer so that the research focuses on the effectiveness of MNN in comparison with CNN in order to evaluate the hypothesis “Is matrix neural network the alternative of convolutional neural network?” with note that convergence speed is also concerned when evaluating this hypothesis. Anyhow, it is necessary to sketch transformer and its extraction mechanism for extracting attention.
Given input data X = X0 and three weight matrices WQ, WK, and WV which are parameters of transformer, then query matrix Q, key matrix K, and value matrix V are specified as follows:
Q = X W Q K = X W K V = X W V
Query matrix Q represents what the system necessarily searches and key matrix K represents the information that sequence X may expectedly contain whereas value matrix V represents the actual content of sequence X (Gemini 2025). As usual, the three parametric matrices WQ, WK, and WV are called query weight matrix, key weight matrix, value weight matrix, respectively. Attention A = A(X) is calculated based on product of Q, K, and V in association with soft-max function, as follows:
A = A X = A t t e n t i o n X = A t t e n t i o n Q , K , V = s o f t m a x Q K T d k V
Attention is learned by association of stochastic gradient descent (SGD) and backpropagation algorithm for maximizing the likelihood function which in turn is squared Frobenius norm of the distance between computed attention A and environmental real attention A’.
L A = 1 2 A ' A 2 = 1 2 i j a i j ' a i j 2
Attention learning is the core of learning transformer and the essence of attention learning is to estimate the three parametric matrices WQ, WK, and WV.
W Q = argmax W Q L A W K = argmax W K L A W V = argmax W V L A
According to SGD algorithm, parameters WQ, WK, and WV are computed iteratively as follows:
W Q = W Q + γ L A W Q W K = W K + γ L A W K W V = W V + γ L A W V
Note, γ (0 < γ ≤ 1) is learning rate whereas L A W Q , L A W K , and L A W V are first-order derivatives (gradients) of L(A) with respect to WQ, WK, and WV, respectively. However, it is not easy to calculate directly calculating L A W Q , L A W K , and L A W V based only on L(A).
Recall that SGD is applied into attention learning but calculating gradients in SGD is seemly infeasible in practice because derivative of the matrix-by-matrix soft-max function s o f t m a x Q K T d k is high-dimensional tensor. Fortunately, s o f t m a x Q K T d k is calculated by row-by-row way so that its rows are mutually independent so that the likelihood function L(A) will be partitioned into partial likelihood functions li(ai) where every partial attention ai is the column vector representing the ith row of A.
l i a i = 1 2 a i ' a i 2 = 1 2 j a i j ' a i j 2
Such that differential of L(A) are sum of partial differentials of li(ai).
L A = i = 1 m l i a i
d L A = i = 1 m d l i a i
Note,
s o f t m a x Q K T d k = p 1 T , p 2 T , , p m T T R = r 1 T , r 2 T , , r m T T = Q K T p i = s o f t m a x q i T k 1 d k , q i T k 2 d k , , q i T k m d k T = p i 1 , p i 2 , , p i m T p i j = e x p q i T k j d k l = 1 m e x p q i T k l d k r i = r i 1 , r i 1 , , r i m T = q i k 1 T d k , q i k 2 T d k , , q i k m T d k T a i = a i 1 , a i 2 , , a i d v T
Given
X = x 1 T , x 2 T , , x m T T x i = x i 1 , x i 2 , , x i d m T
Such that:
s o f t m a x R = s o f t m a x Q K T d k p i = s o f t m a x r i
Note, dimensions of weight matrices WQ, WK, and WV are dm x dk, dm x dk, and dm x dv, respectively. Dimensions of matrices Q, K, and V are m x dk, m x dk, and m x dv, respectively whereas dimension of input data X is m x dm. In literature, dm which is most important is called model dimension, which establishes length of input-output vector. Given the self-attention ai = (ai1, ai2,…, a 1 d v )T of the ith token, its element a i j = k = 1 m p i k v k j is the weighted sum of all jth terms (at the same jth column) over all tokens where every kth weight is specified as the probability pik that measures the similarity of the ith token and the kth token. It is interesting that such probability pik which is essentially the similarity is calculated as the association of dot product (non-normalized cosine) and soft-max function (for normalizing), which can be interpreted by another way as the frequency that the current ith token matches the kth token.
By partitioning L(A) into many li(ai), it is possible to calculate gradients L A W Q , L A W K , and L A W V . Indeed, differential of li(ai) with respect to qi is:
d l i = t r a i ' a i T d a i = t r a i ' a i T d p i T V T = t r a i ' a i T d V T p i = t r a i ' a i T V T p i ' d r i = t r a i ' a i T V T p i ' r i 1 q i , r i 2 q i , , r i m q i T d q i
where pi’ is derivative of the soft-max function pi, which is mentioned later.
p i ' = s o f t m a x ' r i = d s o f t m a x r i d r i
Due to:
r i j q i = q i k j T d k q i = 1 d k k j T
We obtain:
d l i = 1 d k t r a i ' a i T V T p i ' K d q i = 1 d k t r a i ' a i T V T p i ' K d q i
Due to:
q i T = x i T W Q
Differential of li(ai) with respect to WQ is:
d l i = 1 d k t r a i ' a i T V T p i ' K d x i T W Q T = 1 d k t r a i ' a i T V T p i ' K d W Q T x i = 1 d k t r x i a i ' a i T V T p i ' K d W Q T D u e t o   t r B C D = t r C D B = t r D B C
Therefore, gradient of li(ai) with respect to WQ is:
l i W Q = 1 d k x i a i ' a i T V T p i ' K
Note, every partial real attention ai is the column vector representing the ith row of the real attention A’. This means gradient of L(A) with respect to WQ is:
L A W Q = 1 d k i = 1 m x i a i ' a i T V T p i ' K
where pi’ is derivative of the soft-max function pi, which is mentioned later. Due to:
d L A = i = 1 m d l i a i
Similarly, gradient of L(A) with respect to WK is:
L A W K = 1 d k i = 1 m x i a i ' a i T V T p i ' Q
Recall that estimating WQ and WK is totally determined as follows:
W Q = W Q + γ L A W Q
W K = W K + γ L A W K
Consequently, it is necessary to calculate gradient of L(A) with respect to WV. Indeed, differential of L(A) with respect to WV is:
d L = t r A ' A T d A = t r A ' A T d s o f t m a x Q K T d k V = t r A ' A T s o f t m a x Q K T d k d V = t r A ' A T s o f t m a x Q K T d k d X W V = t r A ' A T s o f t m a x Q K T d k X d W V
Therefore, gradient of L(A) with respect to WV is:
L A W V = X T s o f t m a x Q K T d k T A ' A
Such that WV is totally estimated as follows:
W V = W V + γ L A W V
Recall that parametric weight matrices of transformer-based self-attention are totally estimated by SGD algorithm as follows:
W Q = W Q + γ 1 d k i = 1 m x i a i ' a i T V T p i ' K
W K = W K + γ 1 d k i = 1 m x i a i ' a i T V T p i ' Q
W V = W V + γ X T s o f t m a x Q K T d k T A ' A
The most important thing to estimate parametric matrices WQ and WK is to calculate the derivative pi’ of the probability pi of soft-max function.
p i ' = s o f t m a x ' r i = d s o f t m a x r i d r i
p i = s o f t m a x r i = p i 1 , p i 2 , , p i m T
r i = r i 1 , r i 1 , , r i m T = q i k 1 T d k , q i k 2 T d k , , q i k m T d k T
Indeed, we have:
p i ' = s o f t m a x ' r i = p i 1 r i 1 p i 1 r i 2 p i 1 r i m p i 2 r i 1 p i 2 r i 2 p i 2 r i m p i m r i 1 p i m r i 2 p i m r i m
where,
p i j r i k = p i j δ j k p i k δ j k = 1   i f   j = k 0   i f   j k
Because transformer is here estimated by association of stochastic gradient descent (SGD) algorithm and backpropagation algorithm, it is necessary to calculate the backward bias (bias, reward, backward reward, backward error, error) which is the gradient of the likelihood L(A) with respect to input data X:
b X = d L X d X
Due to:
A = s o f t m a x Q K T d k V
Let dL(X) be the differential of likelihood L(A) with respect to input data X:
d L X = d L Q K T + d L V
where dL(QKT) is the differential of the likelihood with respect to X given QKT as intermediate variable and dL(V) is the differential of the likelihood with respect to X given V as intermediate variable, as a convention. Recall that, by the technique of likelihood partition, the partial differential dli with respect to qi is:
d l i = t r a i ' a i T V T p i ' d r i = 1 d k t r a i ' a i T V T p i ' K d q i
Similarly, the partial differential dli with respect to ki is:
d l i = t r a i ' a i T V T p i ' d r i = 1 d k t r a i ' a i T V T p i ' Q d k i
This implies that partial differential dli with respect to xi which is the column vector representing the ith row of X is:
d l i x i = 1 d k t r a i ' a i T V T p i ' K d q i + Q d k i = 1 d k t r a i ' a i T V T p i ' K d x i T W Q T + Q d x i T W K T = 1 d k t r a i ' a i T V T p i ' K W Q T d x i + 1 d k t r a i ' a i T V T p i ' Q W K T d x i = 1 d k t r a i ' a i T V T p i ' K W Q T + Q W K T d x i
Obviously, gradient of likelihood li with respect to xi is:
d l i d x i = 1 d k a i ' a i T V T p i ' K W Q T + Q W K T T
Therefore, gradient of likelihood L(QKT) with respect to X given QKT is determined by following concatenation:
d L Q K T d X = d l 1 d x 1 T , d l 2 d x 2 T , , d l m d x m T T
Recall that the differential of L(A) with respect to V is:
d L = t r A ' A T d A = t r A ' A T d s o f t m a x Q K T d k V = t r A ' A T s o f t m a x Q K T d k d X W V
This implies that differential of L(.) with respect to X given A is:
d L V = t r A ' A T s o f t m a x Q K T d k d X W V = t r W V A ' A T s o f t m a x Q K T d k d X
Obviously, gradient of likelihood L(A) with respect to X is:
d L A d X = s o f t m a x Q K T d k T A ' A W V T
Due to:
d L X = d L Q K T + d L V
The backward bias/error b(X) which is the gradient of the likelihood L(A) with respect to input data X is determined as following sum:
b X = d L X d X = d L Q K T d X + d L A d X = d l 1 d x 1 T d l 2 d x 2 T d l m d x m T + s o f t m a x Q K T d k T A ' A W V T
where,
d l i d x i = 1 d k a i ' a i T V T p i ' K W Q T + Q W K T T
Recall that this research focuses on evaluating the hypothesis “whether matrix neural network is the alternative of convolutional neural network”, which means to test the preeminence of matrix neural network (MNN) in comparison with convolutional neural network (CNN). Moreover, the hypothesis in this research relates to transformer too because attention, which is the first part of transformer, is totally based on matrix although the second part of transformer, which is feed-forward network (FFN), can be matrix artificial neural network called MNN or vector artificial neural network called vector ANN. As a convention, the transformer whose FFN is MNN is called matrix transformer and the transformer whose FFN is vector ANN is called vector transformer. As a convention, vector ANN is called ANN in short if there is no additional explanation and matrix ANN is MNN. Among many applications related to CNN and transformer, this research focuses on application of image classification and thus, image classification is tested with ANN, MNN, matrix transformer and vector transformer. Therefore, the main hypothesis “whether MNN is the alternative of CNN” will lead to the first auxiliary hypothesis “whether attention is the alternative of CNN too” when attention plays the role of feature like convolutional layer although the attention-based feature is self-structure or internal structure, actually. Besides, if both MNN and attention are good enough in comparison with CNN although both of them cannot totally replace CNN, then FFN of transformer should be MNN because of the harmonization of both computational cost and accuracy. In other words, the second auxiliary hypothesis “whether FFN of transformer should be MNN” (whether matrix transformer is preferred to vector transformer) is tested.
An auxiliary problem needs to be solved in this research, which relates to convergence of ANN is to keep the accuracy stable when such accuracy in image classification decreases due to the ANN convergence. Indeed, although high convergence is a significantly strong point of ANN, it becomes inversely drawback of ANN because image input of ANN classifier is too highly dimensional data whereas class output of ANN classifier is too lowly dimensional data, which causes the side-effect that ANN classifier will focus/converge to a very few of classes. To overcome this drawback, this research proposes a so-called concept of baseline where baseline b consists of n basepoints corresponding to n classes. Each basepoint bi is the average probability that images belong to class i.
b i = 1 N i k = 1 N i p i k       w h e r e   p i ( k )   i s   t h e   o u t p u t   p r o b a b i l i t y   o f   t h e   c l a s s i f i e d   i m a g e   k   o n   c l a s s   i a n d   s u c h   i m a g e   k   r e a l l y   b e l o n g s   t o   c l a s s   i
Where Ni is the total number of images that really belong to class i in training dataset. Note,
b = b 1 , b 2 , , b n T
Let p = (p1, p2,…, pn)T be the output probability vector of some image on n classes, which is the output of ANN classifier, indeed. Let d be the probability deviation between baseline b and output p.
d = p b = d 1 = p 1 b 1 , d 2 = p 2 b 2 , , d n = p n b n T
Such image will be classified as class k if dk is largest or dk is larger than some threshold. In general, the ANN-based image classification in this research is based on upper level from baseline (not from zero). In advanced research, the probability deviation d is improved by a so-called adjusted-line. For instance, the adjusted-line a = (a1, a2,…, an)T consists of n adjusted-points corresponding to n classes. Each basepoint ai is the average output (average probability) corresponding to class i.
a i = 1 N k = 1 N o i k   w h e r e   o i ( k )   i s   t h e   o u t p u t   o f   t h e   i m a g e   k   c o r r e s p o n d i n g   t o   c l a s s   i
where N is the total number of images in training dataset. Please pay attention that oi(k) is just the ith output of ANN classifier whose input is image k but such image k may not belong to class i. The advanced probability deviation d* is wise-multiplication of probability deviation d and soft-max function of adjusted-line a.
d = d 1 , d 2 , , d n T = d s o f t m a x a
Such image will be classified as class k if dk* is largest or dk* is larger than some threshold. It is easy to recognize that the probability deviation d is weighted by the concentrations (weights) on C classes, which implies the convergence of leaning ANN. Experimental design for hypothesis testing on image classification with ANN classifiers in both matrix form and vector form is described in next section. The implementation source code of matrix neural network, transformer, and convolutional network is available at

4. Experimental Design

The experiments in this research focus on image classification by matrix neural network (MNN) called MNN classifier, convolutional neural network (CNN) called CNN classifier, and transformer called transformer-based classifier (TB classifier). The main purpose of this experimental design is to evaluate the preeminence of matrix neural network in comparisons among MNN classifier, CNN classifier, and TB classifier, which is also related to diversity metric and convergence speed although accuracy metric in classification is also concerned because there are many excellent research resulting out very high accuracy in image classification by specific establishment of convolutional layers. In testing dataset, MNN classifier, TB classifier, and CNN classifier are represented by the “feature” field whose values are “weight”, “trans”, and “conv”, respectively. Obviously, weight feature represents MNN classifier and trans feature represents TB classifier whereas conv feature represents CNN classifier. The aspect of traditional artificial neural network is that its layers are vectors, in other word, its aspect is vectorization. Therefore, testing dataset has the second field called “vec” field whose values are true and false representing whether or not MNN classifier, TB classifier, and CNN classifier are vectorized. When each classifier is vectorized, only its feedforward network (FFN) is vectorized because each classifier always has two parts: 1) the first part for feature extraction is always MNN, attention, or convolutional layers and 2) the second part for classification is always FFN. Note that the first parts of MNN classifier, TB classifier, and CNN classifier are MNN, attentional set (including only attentions), and convolutional set (including only convolutional layers), respectively. Experimental design also aims to two-way analysis of variance (two-way ANOVA) whose groups are feature field and vec field. Ultimately, the two-way ANOVA aims to test the three hypotheses: 1) whether MNN is the alternative of CNN, which focuses on testing the preeminence of MNN classifier in comparison with TB classifier and CNN classifier via testing the feature group, 2) whether attention is the alternative of CNN, which focuses on testing the preeminence of TB classifier in comparison with MNN classifier and CNN classifier via testing the feature group, and 3) whether FFN of transformer should be MNN via testing the vec group.
Recall that the three classifiers such MNN classifier, TB classifier, and CNN classifier have always two parts such as first part for image feature extraction and second part for feature classification. Therefore, MNN classifier has one MNN and one FFN and TB classifier has one attentional set and one FNN whereas CNN classifier has one convolutional set and one FFN. Note that attentional set is the set of consequential attentions and convolutional set is the network of convolutional layers. The depth of MNN, attentional set, and convolutional set exclude their input layers. The depth of attentional set is the number of its attentions and the depth of convolutional set is the number of its convolutional layers. MNN, attentional set, and convolutional set follow VGG architecture. As a convention, attention in attentional set is called attentional layer so that all elements of MNN, attentional set, and convolutional set are layers. Moreover, attentional set and convolutional set are called attentional network and convolutional network, as a convention. Therefore, MNN, attentional network, and convolutional network are indicated by the common term which is feature extraction network (FEN) so that FFN is called feature classification network (FCN). For distinguishing MNN, attentional network, and convolutional network, MNN is called matrix FEN whereas attentional network is called attentional FEN and convolutional network is called convolutional FEN. A FEN has some blocks and each block is a set of sequential layers. A (FEN) block of matrix FEN, attentional FEN, or convolutional FEN is called matrix block, attentional block, or convolutional block, respectively. A (FEN) layer of matrix block, attentional block, or convolutional block is called matrix layer, attentional layer, or convolutional layer, respectively. A layer of FCN is called FCN layer, which is always traditional neural network layer. Every FEN layer is always associated with a so-called FEN filter but please distinguish FEN filter from filtering kernel aforementioned. FEN filter of matrix layer is parametric weight matrix as usual, which is called matrix filter. FEN filter of attentional layer is attention itself, which is called attentional filter. FEN filter of convolutional layer is filtering kernel as usual, which is called convolutional filter. However, please pay attention that the last layer of any FEN block is the specific layer called max-pooling layer whose filter is aforementioned max-pooling kernel which is called max-pooling filter. In other words, max-pooing layer is always associated with max-pooling filter. Matrix filter, attentional filter, and convolutional filter can be called matrix weight, attentional weight, and convolutional weight, respectively because both weight matrix and filtering kernel are matrices except that weight matrix is applied into entire layer whereas filter kernel is applied into every small part (window) of layer.
The experimental design has two important configuration variable such as experimental variable “base” and experimental variable “layer” with respect to FEN. The current setting of base variable is 2 and there are two values of layer variable such as 2 and 3. The base variable (base=2) is the divisor of block size, which leads that all classifiers (MNN, TB, CNN) in the experimental design have two FEN blocks. Layer variable (2, 3) is the number of layers in each block. Given the number of layers is 2 (layer=2), the first block and the second block have two layers and three FEN layers, respectively, plus one max-pooling layer. Exactly, given layer=2, the first FEN block has two FEN layers and the second FEN block has three FEN layers and one max-pooling layer so that FEN has five layers. Similarly, given layer=3, the first FEN block has two FEN layers and the second FEN block has three FEN layers and one max-pooling layer so that FEN has seven layers. Suppose layer size follows three-dimension size like width x height x depth, all layers in the same block have the same width x height size and such width x height size is decreased according to block ordering by divisor base. Therefore, the width x height size of all layers in the same block is considered as width x height block size or block size. Recall that the base variable (base=2) is the divisor of block size so that block size is decreased according to block ordering by divisor base. Suppose the first block has size of 32 x 32, then the second block has size of 16 x 16. The depth of all layers in the same block is considered as block depth. Block depth is also the count of filters (filter count) in every layer in the same block. Block depth is increased according to block ordering by exponential function of variable layer. For instance, given layer=2 with respect to FEN, the first layer has 2 = 21 filters, the second layer has 2 = 21 filters, the third layer is max-pooling filter layer, the fourth layer has 4 = 22 filters, and the fifth layer has 4 = 22 filters with note that the first layer, the second layer, and the third belong to first block whereas the fourth layer and the fifth layer belong to second block. Given layer=3 with respect to FEN, the first layer has 3 = 31 filters, the second layer has 3 = 23 filters, the third layer has 3 = 31 filters, the fourth layer is max-pooling filter layer, the fifth layer has 9 = 32 filters, the sixth layer has 9 = 32 filters, and the seven layer has 9 = 32 filters with note that the first layer, the second layer, the third layer, and the fourth layer belong to first block whereas the fifth layer, the six layer, and the seventh layer belong to second block. Following table summarizes the configuration of FEN for every classifier (MNN, TB, CNN) given layer=2.
Preprints 206110 i001
Following table summarizes the configuration of FEN for every classifier (MNN, TB, CNN) given layer=3.
Table 4.2. Two FEN blocks given layer variable is 3.
Table 4.2. Two FEN blocks given layer variable is 3.
Block Layer Input
dimension
Filter
dimension
Filter
count
Output
dimension
1 layer1_1 32 x 32 x 3 3 x 3 x 3 3 32 x 32 x 3
layer1_2 32 x 32 x 3 3 x 3 x 3 3 32 x 32 x 3
layer1_3 32 x 32 x 3 3 x 3 x 3 3 32 x 32 x 3
maxpool1 32 x 32 x 3 2 x 2 16 x 16 x 3
2 layer2_1 16 x 16 x 3 3 x 3 x 3 9 16 x 16 x 9
layer2_2 16 x 16 x 9 3 x 3 x 9 9 16 x 16 x 9
layer2_3 16 x 16 x 9 3 x 3 x 9 9 16 x 16 x 9
The experimental design establishes FCN to have always two layers where the last layer (output layer) is the set of image classes. Recall that whether or not FCN is vectorized depends on “vec” field in dataset (vec group in ANOVA). Finally, each classifier (MNN, TB, CNN) including FEN and FCN has 7 layers given layer=2 and has 9 layers given layer=3.
Recall that the experimental design focuses on evaluating the preeminence of matrix neural network in comparing image classifiers and so, accuracy metric which measures the ratio of correct number of images to total number of all images is not so essential in this research when other deep image classification researches achieve very high accuracy by specific set of convolutional layers. Anyhow accuracy metric which is the ratio of the number of correctly classified images to the total number of images is still measured in this research, as follows:
a c c u r a c y = t h e   n u m b e r   o f   c o r r e c t l y   c l a s s i f i e d   i m a g e s t h e   t o t a l   n u m b e r   o f   i m a g e s
Because accuracy metric in this research is not larger than good enough level 0.6 (60%), it is necessary to survey the diversity in classification of the aforementioned classifiers when accuracy metric does not reach 0.11 (11%) with note that suppose a classifier does nothing except fixing one resulted class randomized among 10 classes, then such classifier still reaches accuracy 0.1 (10%) because the experimental dataset is CIFAR-10 dataset including 5 training folders and 1 testing folder where each folder has 10000 images grouped in n=10 classes and every class has equal number of 1000 images, which means that the probability that an image belongs to a certain class is always 0.1 (10%) when every class in testing dataset containing 10000 images has always 1000 images. Therefore, given n=10 classes, the experimental design proposes n=10 pairs of precision metrics and recall metrics. For instance, given the ith class, the ith precision is the ratio of the number of correctly classified images in the ith class to the number of classified images in the ith class whereas the ith recall is the ratio of the number of classified images in the ith class to the number of images actually belong to the ith class. The ith precision of ith class is specified as follows:
p r e c i s i o n i = t h e   n u m b e r   o f   c o r r e c t l y   c l a s s i f i e d   i m a g e s   i n   i t h   c l a s s t h e   n u m b e r   o f   c l a s s i f i e d   i m a g e s   i n   i t h   c l a s s
The ith recall of ith class is specified as follows:
r e c a l l i = t h e   n u m b e r   o f   c l a s s i f i e d   i m a g e s   i n   i t h   c l a s s t h e   n u m b e r   o f   i m a g e s   b e l o n g   t o   i t h   c l a s s
The ith F1 metric which harmonizes ith precision and ith recall is specified as follows:
F 1 i = 2 p r e c i s i o n i r e c a l l i p r e c i s i o n i + r e c a l l i
The overall precision, recall, and F1 over n=10 classes are means of partial ith precisions, partial ith recalls, and ith F1 (s), respectively. These metrics concern classes and diversity in classes.
p r e c i s i o n = 1 n i = 1 n p r e c i s i o n i r e c a l l = 1 n i = 1 n r e c a l l i F 1 = 1 n i = 1 n F 1 i w h e r e   n = 10
The internal meanings of precision metric and accuracy metric are very similar but they are still finely different. Within the same degree of accuracy, precision metric will be higher than accuracy metric if classifier obtain good enough diversity. For instance, the do-nothing classifier does nothing except fixing one resulted class randomized among n=10 classes always obtain accuracy metric 0.1 (10%) but its precision metric is always smaller than precision metric of the classifier that reaches the same accuracy metric 0.1 but covers more than one class. Although classification diversity is hidden under precision metric, recall metric, and F1 metric, the so-called class coverage is proposed to illustrate visually the classification diversity. For instance, class coverage denoted cc is the ratio of the number of classified classes to the number of total classes with note that a classified class is the class whose corresponding recall is positive. In other words, if any image is classified into a class by classifier, such class becomes classified class.
c c = t h e   n u m b e r   o f   c l a s s i f i e d   c l a s s e s t h e   n u m b e r   o f   t o t a l   c l a s s e s
In general, there are 4 main testing metrics such as accuracy, precision, recall, and F1 and 1 additional metric such as class coverage, in which F1 is the final metric that evaluates preeminence of every classifier. The second important metric is accuracy metric because deep neural network turns forwards some convergence so that it often focuses on some certain class among n=10 classes, which decreases diversity but increases accuracy degree. Computational convergence is strong point of artificial neural network (ANN) but unfortunately, it becomes inversely drawback in image classification because image input is high-dimension data whereas class output is low-dimension data. If class output is large as corpus dictionary in natural language processing (NLP), such computational convergence aspect of ANN become much more strong point of deep neural network classifier. Other researches focusing on obtaining highly accurate image classification always try to increase the accuracy metric and keep diversity stable in parallel by setting specific configuration of many enough filters and convolutional layers.

5. Results and Discussions

The experimental dataset is CIFAR-10 dataset including 5 training folders and 1 testing folder where each folder has 10000 images grouped in 10 classes and every class has equal number of 1000 images. Given experimental variable “layer” is 2, CIFAR-10 dataset is tested 10 times for 3 values (“weight”, “trans”, “conv”) of feature group and 2 values (“true”, “false”) of vec group in order to derive 60 records of ANOVA dataset whose groups are “vec” and “feature” and whose treatments are “accuracy”, “precision”, “recall”, and “F1” corresponding to accuracy metric, precision metric, recall metric, and F1 metric aforementioned. Given experimental variable “layer” is 3, CIFAR-10 dataset is tested again 10 times for 3 values (“weight”, “trans”, “conv”) of feature group and 2 values (“true”, “false”) of vec group in order to derive other 60 records of ANOVA dataset. Following table illustrates 120 records of full ANOVA dataset.
Table 5.1. Illustration of ANOVA dataset.
Table 5.1. Illustration of ANOVA dataset.
layer vec feature accuracy precision recall F1 cc
1 2 false weight 0.0826 0.0980 0.7623 0.1678 1.0000
2 false trans 0.0727 0.0841 0.7903 0.1350 1.0000
3 false conv 0.1000 0.0400 0.4000 0.0727 0.4000
4 true weight 0.0831 0.0872 0.7244 0.1524 1.0000
5 true trans 0.0995 0.1028 0.7543 0.1745 1.0000
6 true conv 0.0995 0.0896 0.4219 0.1090 0.9000
55 false weight 0.0782 0.0844 0.7532 0.1480 1.0000
56 false trans 0.0869 0.0861 0.8044 0.1502 1.0000
57 false conv 0.1018 0.1484 0.3899 0.1212 1.0000
58 true weight 0.1012 0.1078 0.8043 0.1843 1.0000
59 true trans 0.0866 0.0995 0.7281 0.1537 1.0000
60 true conv 0.0828 0.0838 0.7532 0.1476 1.0000
61 3 false weight 0.1010 0.1110 0.8207 0.1866 1.0000
62 false trans 0.0783 0.0815 0.8688 0.1450 1.0000
63 false conv 0.1000 0.0200 0.2000 0.0364 0.2000
64 true weight 0.0794 0.0833 0.7425 0.1466 1.0000
65 true trans 0.0800 0.0850 0.7450 0.1443 1.0000
66 true conv 0.1000 0.0300 0.3000 0.0545 0.3000
115 false weight 0.0712 0.0701 0.7380 0.1177 1.0000
116 false trans 0.0756 0.0872 0.7852 0.1446 1.0000
117 false conv 0.1000 0.0300 0.3000 0.0545 0.3000
118 true weight 0.0814 0.0889 0.8168 0.1480 1.0000
119 true trans 0.0858 0.0838 0.8438 0.1479 1.0000
120 true conv 0.0969 0.0924 0.5054 0.1203 1.0000
The ANOVA dataset is available at
Recall that the two-way ANOVA aims to test the three hypotheses on the ANOVA dataset of size 120: 1) whether MNN is the alternative of CNN, which focuses on testing the preeminence of MNN classifier in comparison with TB classifier and CNN classifier via testing the feature group, 2) whether attention is the alternative of CNN, which focuses on testing the preeminence of TB classifier in comparison with MNN classifier and CNN classifier via testing the feature group, and 3) whether FFN of transformer should be MNN via testing the vec group. The research applies Gemini 2025 into making ANOVA test. Moreover, only baseline is applied into improving classification task and adjusted-line is not applied.
Firstly, the two-way ANOVA test focuses on groups “vec” and “feature” against treatments “accuracy” and “F1” in which accuracy metric measures the entire accuracy of classifiers over dataset whereas F1 metric makes the compromise between diversity and accuracy. The following table shows the ANOVA results for accuracy metric, which indicates how groups “vec” and “feature” affect on treatment “accuracy” (Gemini 2025).
Table 5.2. ANOVA results for accuracy metric.
Table 5.2. ANOVA results for accuracy metric.
Source SS DF F-statistic P-value Significant
vec 6.31e-07 1 0.0121 0.9125 No
feature 0.00677 2 65.0845 < 0.0001 Yes
vec:feature 6.87e-05 2 0.6604 0.5186 No
Residual 0.00593 114
Note, SS is the abbreviation of sum of squares and DF is the abbreviation of degree of freedom whereas the source “vec:feature” represents the interaction effect between group “vec” and group “feature”. From the result table above, group “feature” affects highly significantly on treatment “accuracy” in comparison with group “vec” because the F-statistic of group “feature” (65.0845) is highly larger than the F-statistic of group “vec” (0.0121) with very small P-value (< 0.0001). Moreover, there is no significant interaction between group “vec” and group “feature” because the F-statistic of the interaction effect “vec:feature” (0.6604) is small and its P-value (0.5186) is too larger than 0.05.
The following table shows the ANOVA results for F1 metric, which indicates how groups “vec” and “feature” affect on treatment “F1” (Gemini 2025).
Table 5.3. ANOVA results for F1 metric.
Table 5.3. ANOVA results for F1 metric.
Source SS DF F-statistic P-value Significant
vec 0.00074 1 1.7781 0.1850 No
feature 0.09846 2 118.3605 < 0.0001 Yes
vec:feature 0.00139 2 1.6758 0.1917 No
Residual 0.04741 114
From the result table above, group “feature” affects highly significantly on treatment “F1” in comparison with group “vec” because the F-statistic of group “feature” (118.3605) is highly larger than the F-statistic of group “vec” (1.7781) with very small P-value (< 0.0001). Moreover, there is no significant interaction between group “vec” and group “feature” because the F-statistic of the interaction effect “vec:feature” (1.6758) is small and its P-value (0.1917) is too larger than 0.05.
The two-way ANOVA test focuses again on groups “vec” and “feature” against treatments “precision” and “recall” in which precision metric leans forwards accuracy and recall metric leans forward diversity with note F1 is the compromiser of precision and recall. The following table shows the ANOVA results for precision metric, which indicates how groups “vec” and “feature” affect on treatment “precision” (Gemini 2025).
Table 5.4. ANOVA results for precision metric.
Table 5.4. ANOVA results for precision metric.
Source SS DF F-statistic P-value Significant
vec 4.8e-5 1 0.1180 0.7318 No
feature 0.01908 2 23.3842 < 0.0001 Yes
vec:feature 0.00024 2 0.2909 0.7481 No
Residual 0.0465 114
From the result table above, group “feature” affects highly significantly on treatment “precision” in comparison with group “vec” because the F-statistic of group “feature” (23.3842) is highly larger than the F-statistic of group “vec” (0.118) with very small P-value (< 0.0001). Moreover, there is no significant interaction between group “vec” and group “feature” because the F-statistic of the interaction effect “vec:feature” (0.2909) is small and its P-value (0.7481) is too larger than 0.05.
The following table shows the ANOVA results for recall metric, which indicates how groups “vec” and “feature” affect on treatment “recall” (Gemini 2025).
Table 5.5. ANOVA results for recall metric.
Table 5.5. ANOVA results for recall metric.
Source SS DF F-statistic P-value Significant
vec 0.00809 1 1.5663 0.2133 No
feature 3.25747 2 315.5278 < 0.0001 Yes
vec:feature 0.01207 2 1.1690 0.3144 No
Residual 0.58846 114
From the result table above, group “feature” affects highly significantly on treatment “recall” in comparison with group “vec” because the F-statistic of group “feature” (315.5278) is highly larger than the F-statistic of group “vec” (1.5663) with very small P-value (< 0.0001). Moreover, there is no significant interaction between group “vec” and group “feature” because the F-statistic of the interaction effect “vec:feature” (1.1690) is small and its P-value (0.3144) is too larger than 0.05.
In general, within metrics accuracy, F1, precision, and recall, for the third hypothesis “FFN of transformer should be MNN (vector FEC is better than matrix FEC)”, it is possible to conclude that FFN/FEC of MNN classifier and TB classifier should be matrix because it is not necessary to vectorize FEC to obtain better classification while vectorized FEC (vector FEC) consumes much more computational resources. Exactly, there is significant evidence to reject null hypothesis of the third hypothesis or there is significant evidence to support the third hypothesis.
Although the experiment focuses on testing the three hypotheses by ANOVA tests on significant differences, it is necessary to summarize means of metrics “accuracy”, “F1”, “precision”, and “recall” with respect to groups “vec” and “feature” (Gemini 2025).
Table 5.6. Means of metrics “accuracy”, “F1”, “precision”, and “recall”.
Table 5.6. Means of metrics “accuracy”, “F1”, “precision”, and “recall”.
vec feature Accuracy F1 Precision Recall
false conv 0.0991 0.0845 0.0614 0.4154
false trans 0.0834 0.1508 0.0881 0.7751
false weight 0.0816 0.1530 0.0887 0.7963
true conv 0.0980 0.0983 0.0631 0.4601
true trans 0.0857 0.1547 0.0926 0.7790
true weight 0.0809 0.1502 0.0863 0.7970
conv 0.0986 0.0914 0.0623 0.4377
trans 0.0846 0.1527 0.0903 0.7771
weight 0.0812 0.1516 0.0875 0.7966
false 0.0880 0.1294 0.0794 0.6623
true 0.0882 0.1344 0.0807 0.6787
Because feature “weight”, “trans”, and “conv” represent classifiers MNN, TB, and CNN, respectively, for the result table above, CNN classifier (0.0986) is better than both TB classifier (0.0846) and MNN classifier (0.0812) in accuracy metric but both MNN classifier and TB classifier are better than CNN classifier in remaining metrics such as F1, precision, and recall. Although both F1 and precision measure the accurate aspect in classification task, they reflect the diversity too. Therefore, CNN classifier is the best in accuracy because accuracy metric is the pure metric to measure the ratio of correct number of images to total number of all images in classifying entire dataset. Recall metric is the intermediate metric between accuracy and diversity but it leans forwards the diversity, which leads that TB classifier within context of transformer can harmonize MNN classifier and CNN classifier because recall metric of TB classifier (0.7771) is between recall metrics of CNN classifier (0.4377) and MNN classifier (0.7966). Moreover, accuracy metric of TB classifier (0.0846) is between accuracy metrics of CNN classifier (0.0986) and MNN classifier (0.0812). Especially, F1 metric of TB classifier (0.1527) is best and precision metric of TB classifier (0.0903) is best, which concludes that TB classifier can be the preeminent one in compromising all factors. Although there is significant evidence to support the third hypothesis which means it is not necessary to vectorize FEC to obtain better classification, vector FEC (accuracy=0.0880, F1=0.1294, precision=0.0794, recall=0.6623) is always slightly better than matrix FEC (accuracy=0.0882, F1=0.1344, precision=0.0807, recall=0.6787).
Secondly, because it is concluded that group “feature” is the most significant factor on accuracy metric and F1 metric, it is necessary to make Post-Hoc Tukey HSD (Honestly Significant Difference) test on values “weight”, “trans”, “conv” of independent variable “feature” in order to test the first hypothesis and second hypothesis. The following table shows the Post-Hoc Tukey HSD results for feature group within accuracy metric, which indicates which specific values (“weight”, “trans”, “conv”) are significantly different among them with context of treatment “accuracy” (Gemini 2025).
Table 5.7. Post-Hoc Tukey HSD results for feature group within accuracy metric.
Table 5.7. Post-Hoc Tukey HSD results for feature group within accuracy metric.
Treatment 1 Treatment 2 Mean
Difference
Adjusted
P-value
Significant
conv trans –0.0140 < 0.001 Yes (conv > trans)
conv weight –0.0173 < 0.001 Yes (conv > weight)
trans weight –0.0034 0.0932 No
For the result table above, the feature “conv” is significantly much larger than features “weight” and “trans” in accuracy metric because its absolute mean differences with respect to feature “weight” and feature “trans” are larger (|–0.0173|, |–0.014|) with P-values (< 0.001) are too smaller than 0.05. Because these differences are negative, the feature “conv” is significantly much better than features “weight” and “trans”. Conversely, there is no significantly difference between feature “weight” and feature “trans” because the P-value (0.0932) given their absolute mean difference (|–0.0034|) is larger than 0.05. Therefore, there is significant evidence to reject the first hypothesis “MNN is the alternative of CNN” and there is no significant evidence to support the second hypothesis “attention is the alternative of CNN”, within accuracy metric. Of course, CNN is better than both MNN and attention within accuracy metric. In other words, within accuracy metric, MNN should not be the alternative of CNN and it is not necessary to replace MNN by attention but attention is slightly better than MNN because the mean difference between feature “weight” and feature “trans” is negative (weight – trans) = –0.0034.
The following table shows the Post-Hoc Tukey HSD results for feature group within F1 metric, which indicates which specific values (“weight”, “trans”, “conv”) are significantly different among them with context of treatment “F1” (Gemini 2025).
Table 5.8. Post-Hoc Tukey HSD results for feature group within F1 metric.
Table 5.8. Post-Hoc Tukey HSD results for feature group within F1 metric.
Treatment 1 Treatment 2 Mean
Difference
Adjusted
P-value
Significant
conv trans 0.0613 < 0.001 Yes (trans > conv)
conv weight 0.0602 < 0.001 Yes (weight > conv)
trans weight –0.0011 0.9687 No
For the result table above, the feature “conv” is significantly much larger than features “weight” and “trans” in F1 metric because its absolute mean differences with respect to feature “weight” and feature “trans” are larger |0.0602|, (|0.0613|) with P-values (< 0.001) are too smaller than 0.05. Because these differences are positive, the feature “conv” is significantly much worse than features “weight” and “trans”. Conversely, there is no significantly difference between feature “weight” and feature “trans” because the P-value (0.9687) given their absolute mean difference (|–0.0011|) is larger than 0.05. Therefore, there is significant evidence to support the first hypothesis “MNN is the alternative of CNN” and there is no significant evidence to support the second hypothesis “attention is the alternative of CNN”, within F1 metric. Of course, CNN is worse than both MNN and attention within F1 metric. In other words, within F1 metric, MNN should be the alternative of CNN and it is not necessary to replace MNN by attention but attention is slightly better than MNN because the mean difference between feature “weight” and feature “trans” is negative (weight – trans) = –0.0011.
The following table shows the Post-Hoc Tukey HSD results for feature group within precision metric, which indicates which specific values (“weight”, “trans”, “conv”) are significantly different among them with context of treatment “precision” (Gemini 2025).
Table 5.9. Post-Hoc Tukey HSD results for feature group within precision metric.
Table 5.9. Post-Hoc Tukey HSD results for feature group within precision metric.
Treatment 1 Treatment 2 Mean
Difference
Adjusted
P-value
Significant
conv trans 0.0280 < 0.001 Yes (trans > conv)
conv weight 0.0252 < 0.001 Yes (weight > conv)
trans weight –0.0028 0.806 No
For the result table above, the feature “conv” is significantly much larger than features “weight” and “trans” in precision metric because its absolute mean differences with respect to feature “weight” and feature “trans” are larger (|0.0252|, |0.0280|) with P-values (< 0.001) are too smaller than 0.05. Because these differences are positive, the feature “conv” is significantly much worse than features “weight” and “trans”. Conversely, there is no significantly difference between feature “weight” and feature “trans” because the P-value (0.806) given their absolute mean difference (|–0.0028|) is larger than 0.05. Therefore, there is significant evidence to support the first hypothesis “MNN is the alternative of CNN” and there is no significant evidence to support the second hypothesis “attention is the alternative of CNN”, within precision metric. Of course, CNN is worse than both MNN and attention within precision metric. In other words, within precision metric, MNN should be the alternative of CNN and it is not necessary to replace MNN by attention but attention is slightly better than MNN because the mean difference between feature “weight” and feature “trans” is negative (weight – trans) = –0.0028.
The following table shows the Post-Hoc Tukey HSD results for feature group within recall metric, which indicates which specific values (“weight”, “trans”, “conv”) are significantly different among them with context of treatment “recall” (Gemini 2025).
Table 5.10. Post-Hoc Tukey HSD results for feature group within recall metric.
Table 5.10. Post-Hoc Tukey HSD results for feature group within recall metric.
Treatment 1 Treatment 2 Mean
Difference
Adjusted
P-value
Significant
conv trans 0.3393 < 0.001 Yes (trans > conv)
conv weight 0.3589 < 0.001 Yes (weight > conv)
trans weight 0.0196 0.4481 No
For the result table above, the feature “conv” is significantly much larger than features “weight” and “trans” in recall metric because its absolute mean differences with respect to feature “weight” and feature “trans” are larger (|0.3589|, |0.3393|) with P-values (< 0.001) are too smaller than 0.05. Because these differences are positive, the feature “conv” is significantly much worse than features “weight” and “trans”. Conversely, there is no significantly difference between feature “weight” and feature “trans” because the P-value (0.4481) given their absolute mean difference (|0.0196|) is larger than 0.05. Therefore, there is significant evidence to support the first hypothesis “MNN is the alternative of CNN” and there is no significant evidence to support the second hypothesis “attention is the alternative of CNN”, within recall metric. Of course, CNN is worse than both MNN and attention within recall metric. In other words, within recall metric, MNN should be the alternative of CNN and it is not necessary to replace MNN by attention but attention is slightly worse than MNN because the mean difference between feature “weight” and feature “trans” is positive (weight – trans) = 0.0196.
In general, following table summarizes results of hypothesis testing on the three hypotheses: 1) MNN is the alternative of CNN, 2) attention is the alternative of CNN, and 3) FFN of transformer should be MNN.
Table 5.11. Results of hypothesis testing.
Table 5.11. Results of hypothesis testing.
Hypothesis 1 Hypothesis 2 Hypothesis 3
accuracy Not supported.
CNN is the best.
Not supported.
TB is slightly better than MNN but worse than CNN.
Supported.
Vector FEC is slightly better than matrix FEC.
F1 Supported.
CNN is worse than both TB and MNN.
Not supported.
TB is slightly better than both MNN and CNN.
Supported.
Vector FEC is slightly better than matrix FEC.
precision Supported.
CNN is worse than both TB and MNN.
Not supported.
TB is slightly better than both MNN and CNN.
Supported.
Vector FEC is slightly better than matrix FEC.
recall Supported.
CNN is worse than both TB and MNN.
Not supported.
TB is slightly worse than MNN and better than CNN.
Supported.
Vector FEC is slightly better than matrix FEC.
Please pay attention that supporting is not acceptance, which means that supporting is slightly stronger than not rejecting. The third hypothesis “MNN is the alternative of CNN” is not supported by only accuracy metric but accuracy metric is the most important metric about accuracy aspect of classification method and so, CNN classifier is the best classifier in accuracy. The second hypothesis “attention is the alternative of CNN” is always not supported but TB is slightly better than MNN (except recall metric) and better than CNN. Moreover, CNN classifier is worse than both TB classifier and MNN classifier in F1 metric which is the compromiser of accuracy and diversity in image classification. This implies that TB classifier is the best classifier in harmonizing all factors (metrics). The third hypothesis “FFN of transformer should be MNN” is always supported although vector FEC is slightly better than matrix FEC, which implies that it is possible to replace traditional neural network (vector ANN) by matrix neural network (MNN) with note that vector ANN consumes much more computational resources due to boom problem of a huge number of parameters. Moreover, attention of transformer can overcome drawback of vector ANN in accuracy.

6. Conclusions

In general, although it is not asserted that matrix neural network (MNN) is the alternative of convolutional neural network (CNN), transformer is surely preeminence approach in deep learning because of three reasons: 1) its attention is an excellent mechanism to extract internal structure / relationship / feature of any kind of high-dimension data according to self-supervised learning, 2) transformer is different from the methodology of traditional ANN consisting of simple layers equipped by parametric weights when attention of transformer concerns both explicit information via value matrix and implicit information via query matric and key matrix, and 3) ANN and attention are incorporated into transformer so that deep learning still always takes advantages of excellent aspects of ANN. Again, although it is not asserted that MNN is the alternative of CNN, it is not strict to implement FFN of transformer by vector ANN. Therefore, it is totally possible that MNN and attention are incorporated into transformer with note that MNN is the ANN implemented by parametric weight matrices in order to reduce significantly computational resources so that its accuracy (performance) is not decreased significantly. Attentional mechanism built in transformer can alleviate the drawback of MNN in performance, especially, in case of huge data. Moreover, for high-dimension data whose dimension is larger than 2, MNN is a possible technique to process such tensor-based data.

References

  1. Chen, L.; Li, S.; Bai, Q.; Yang, J.; Jiang, S.; Miao, Y. Review of Image Classification Algorithms Based on Convolutional Neural Networks. In Remote Sensing; Paheding, S., Alom, Z., Maimaitijiang, M., Maimaitiyiming, M., Eds.; 21 November 2021; Volume 13, 22, pp. 4712–4763. [Google Scholar] [CrossRef]
  2. D'Ascoli, S.; Touvron, H.; Leavitt, M.L.; Morcos, A.S.; Biroli, G.; Sagun, L. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. In Proceedings of the 38th International Conference on Machine Learning. 139, 2021; PMLR; pp. 2286–2296. Available online: https://proceedings.mlr.press/v139/d-ascoli21a.html.
  3. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In arXiv Computer Vision and Pattern Recognition; 2020. [Google Scholar] [CrossRef]
  4. Gao, J.; Guo, Y.; Wang, Z. Matrix Neural Networks. arXiv Preprints 2016, 1–28. [Google Scholar] [CrossRef]
  5. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016; IEEE; pp. 770–778. [Google Scholar] [CrossRef]
  6. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; 2012; p. 25. [Google Scholar] [CrossRef]
  7. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 1998, 86(11), 2278–2324. [Google Scholar] [CrossRef]
  8. Lin, M.; Chen, Q.; Yan, S. Network In Network. In arXiv Neural and Evolutionary Computing; 4 March 2014. [Google Scholar] [CrossRef]
  9. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In arXiv Computer Vision and Pattern Recognition; 10 April 2015. [Google Scholar] [CrossRef]
  10. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Rabinovich, A. Going Deeper With Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015; IEEE; pp. 1–9. Available online: https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Szegedy_Going_Deeper_With_2015_CVPR_paper.html.
  11. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems (NIPS 2017) 30; Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Eds.; NeurIPS: Long Beach, 2017; Available online: https://arxiv.org/abs/1706.03762.
  12. Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Tang, X. Residual Attention Network for Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017, April 23; IEEE; pp. 3156–3164. Available online: https://openaccess.thecvf.com/content_cvpr_2017/html/Wang_Residual_Attention_Network_CVPR_2017_paper.html.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated